Minimum Distribution Support Vector Clustering

Support vector clustering (SVC) is a boundary-based algorithm, which has several advantages over other clustering methods, including identifying clusters of arbitrary shapes and numbers. Leveraged by the high generalization ability of the large margin distribution machine (LDM) and the optimal margin distribution clustering (ODMC), we propose a new clustering method: minimum distribution for support vector clustering (MDSVC), for improving the robustness of boundary point recognition, which characterizes the optimal hypersphere by the first-order and second-order statistics and tries to minimize the mean and variance simultaneously. In addition, we further prove, theoretically, that our algorithm can obtain better generalization performance. Some instructive insights for adjusting the number of support vector points are gained. For the optimization problem of MDSVC, we propose a double coordinate descent algorithm for small and medium samples. The experimental results on both artificial and real datasets indicate that our MDSVC has a significant improvement in generalization performance compared to SVC.


Introduction
Cluster analysis groups a dataset into clusters according to the correlations of data. To date, many clustering algorithms have emerged, such as plane-based clustering algorithm, spectral clustering, density-based DBSCAN [1], OPTICS [2], Density Peak algorithm (DP) characterizing the center of clusters [3], and partition-based k-means algorithm [4]. In particular, the support vector machine (SVM) has become an important tool for data mining. As a classical machine learning algorithm, SVM can well address the issue of local extremum and high dimensionality of data in the process of model optimization, and it makes data separable in feature space through nonlinear transformation [5].
In particular, Tax and Duin proposed a novel method in which the decision boundaries are constructed by a set of support vectors, the so-called support vector domain description (SVDD) [6]. Leveraged by the kernel theory and SVDD, support vector clustering (SVC) was proposed based on contour clustering, which has many advantages over other clustering algorithms [7]. SVC is robust to noise and does not need to pre-specify the number of clusters in advance. For SVC, it is feasible to adjust its parameter C to obtain better performance, but this comes at the cost of increasing outliers, and it only introduces a soft boundary for optimization. Several insights into understanding the features of SVC have been offered in [8,9]. After studying the relevant literature, we found that these insights mainly cover two aspects: the first aspect is the selection of parameters q and C. Lee and Daniels chose a method similar to a secant to generate monotone increasing sequences of q and establish the monotone function of q and radius R, which can be applied to high dimensions; the second aspect is optimizing the cluster assignments. Considering the high cost of the second stage of SVC, several methods have been proposed for improving the cluster partition of SVC. First, Ben et al. improved the original Complete Graph (CG) partition by using the adjacency matrix partition based on SV points, which simplified the original calculation, but this method failed to avoid random sampling. Yang et al. elaborated on the Proximity Graph (PG) to model the proximity structure of the m samples with time complexity of O(m) or O(mlog(m)). However, the complexity of this algorithm increases with the increase in dimensionality [10]. Lee et al. studied a cone cluster labeling (CCL) method by using the geometry of the feature space to assign clusters in the data space. If two cones intersect, the samples in these cones belong to the same cluster [9]. However, the performance of CCL is sensitive to kernel parameter q for the cones decided by q. More recently, Peng et al. designed a partition method that utilized the clustering algorithm of similarity segmentation-based point sorting (CASS-PS) and considered the geometrical properties of support vectors in the feature space to avoid the downsides of SVC and CASS-PS [11]. However, CASS-PS is sensitive to the number and distribution of the support vector points recognized. Jennath and Asharaf proposed an efficient cluster assignment algorithm for SVC using the similarity of feature set for data points utilizing an efficient MEB approximation algorithm [12].
It is well known from the margin theory that maximizing the minimum margin is often not the best way for further improving the learning performance. Regarding this, the introduction of the margin mean and margin variance in distribution can make the model achieve better generalization performance, as revealed by Gao and Zhou [13,14]. In classification and regression analysis, there are many methods for improving the learning performance by considering the statistical information of the data. Zhang and Zhou proposed the large margin distribution machine (LDM) and optimal margin distribution machine (ODM) for data classification, which adjusted the mean and variance to improve the performance of the model [15,16]. In regression analysis, MDR, ε-SVR, LDMR, and v-MDAR considers the marginal distribution to achieve better performance. MDR, proposed by Liu et al., minimizes the regression deviation mean and the regression deviation variance, which introduced the statistics of regression deviation into ε-SVR [17]. To deal with this issue, Wang et al. characterized the absolute regression deviation mean and the absolute regression deviation variance and proposed the v-minimum absolute deviation distribution regression (v-MADR) machine [18]. However, it is not very appropriate when both positive-label and negative-label samples are present. Inspired by LDM, Rastogi et al. also proposed a large margin distribution machine-based regression model (LDMR) [19].
In clustering analysis, for a good clustering, when the labels are consistent with the clustering results, SVM can obtain a larger minimum margin. Inspired by this, maximum margin clustering (MMC) considered the large margin heuristic from SVM and added the maximum margin to all possible markers [20]. Improved versions of MMC are also proposed [21]. The optimal margin distribution clustering (ODMC) proposed by Zhang et al. forms the optimal marginal distribution during the clustering process, which characterizes the margin distribution by the first-and second-order statistics. It also has the same convergence rate as state-of-the-art cutting plane-based algorithms [22].
The success of the aforementioned models suggests that there may still exist room for further improving SVC. These models do not involve the improvement in the generalization performance of SVC, that is, the reconstruction of hyperplane, when the distribution of data is fixed in feature space. In this research, we propose a novel approach called minimum distribution support vector clustering (MDSVC), and our novel contributions are as follows:

•
We characterize the envelope radius of minimum hypersphere by the first-and secondorder statistics, i.e., the mean and variance. By minimizing these two statistics, it can avoid the problem of too many or too few support vector points caused by the inappropriate kernel width coefficient q to some extent, form a better cluster contour, and, thus, improve the accuracy. • We enhance the generalization ability and robustness of the algorithm by introducing these statistics while the distribution of data is fixed for the given q in feature space.

•
We further prove that our method has better performance inspired by the expectation of the probability of test error proposed in SVDD.

•
We customize a dual coordinate descent (DCD) algorithm to optimize the objective function of MDSVC for our experiments.
The remainder of this paper is organized as follows. Section 2 introduces the notations, the recent progress in the margin theory, and the SVC algorithm. In Section 3, we present the MDSVC algorithm, which minimizes the mean and the variance, and propose a DCD algorithm to solve the objective function of MDSVC. Section 4 reports our experimental results on both artificial and real datasets. We discuss our method in Section 5 and draw conclusions in Section 6.

Background
Suppose D = [x 1 , . . . , x m ] is a dataset of m samples, and each column is a sample of a d-dimensional vector. φ(x) is the mapping function induced by a kernel k, i.e., Obviously, we have k(x, x) = 1. Both MDSVC and SVC aim to obtain the radius R of the sphere, center a of the hypersphere, and the radius of each point in feature space. Formally, we denote X the matrix whose i-th column is φ( In this paper, we use the Gaussian kernel as our nonlinear transformation approach to map data points to feature space.

Recent Progress in Margin Theory
Recent margin theory indicates that maximizing the minimum margin may not lead to an optimal result and better generalization performance. In the SVC algorithm, when the kernel width coefficient q is selected, the distribution of data points mapped to the feature space is determined. If the distribution of boundary data is different from that of internal data, the hyperplane constructed by SVC may not make better use of the data information, thus reducing the performance of SVC. Additionally, we note that SVC is always overfitting with too many support vector points in practice. Gao and Zhou have already demonstrated that marginal distribution is critical to the generalization performance [13]. The high generalization ability of margin has been shown in v-MADR, which minimizes both the absolute regression deviation mean and the absolute regression deviation variance [18]. We also note that SVC can be regarded as a binary classifier divided by the division hyperplane. Inspired by the aforementioned research, we introduce the mean and variance of the marginal distribution and minimize them to reduce the number of support vector points.
For the convenience of readers, a more detailed description of SVC is presented in Appendix A.

Minimum Distribution Support Vector Clustering
In this section, we briefly delineate the process of MDSVC, including three subsections, the formula of MDSVC, which minimizes both the mean and the variance, the optimization algorithms based on dual coordinate descent method, and the statistical property of MDSVC that shows the upper bound of the expectation of error. In this research, as mentioned before, we take the Gaussian kernel as a nonlinear transformation approach to map data points to the feature space, and then we derive k(x, x) = 1, which is critical for us to simplify the variance and solve the objective function. In addition, we define the mean and variance based on the Euclidean distance. The reason we employ the Euclidean distance is that we can take the objective function as the convex quadratic programming function and the Euclidean norm represents the actual distance between two points rather than the distance on the surface.
We delineate the idea of our algorithm in the feature space in Figure 1 roughly, and more detailed descriptions are given in Sections 3.1.1 and 3.1.2. First, the hyperplanes  Figure 1a. By characterizing and minimizing our mean and variance, we can, thus, have the hypersphere of MDSVC as an inclined curved surface in the feature space, as indicated in red in Figure 1a. The intersection of the SVC's hypersphere and the unit sphere is a cap-like area. We further illustrate the main difference between MDSVC and SVC through a lateral view and top view, which are shown in Figure 1b,c, respectively. Figure 1b is the schematic diagram of the MDSVC's Cap and the SVC's Cap. We can find that the center a of MDSVC's hypersphere moves away from the center of the ball and inclines to the distribution of the overall data because of the mean and variance. In Figure 1c, we use Soft-R svc to represent the soft boundary of SVC. The centers of the three spheres, namely the unit ball, SVC's hypersphere, and MDSVC's hypersphere, are denoted by o, a svc , and a, respectively. We also use red points to indicate the SVs of MDSVC. As shown in Figure 1c, we can see how the boundary of MDSVC R is determined. Finally, we use Figure 1d to show the distribution of data points and the details of the Cap formed by SVC.
critical for us to simplify the variance and solve the objective function. In addition, we define the mean and variance based on the Euclidean distance. The reason we employ the Euclidean distance is that we can take the objective function as the convex quadratic programming function and the Euclidean norm represents the actual distance between two points rather than the distance on the surface.
We delineate the idea of our algorithm in the feature space in Figure 1 roughly, and more detailed descriptions are given in Sections 3.1.1 and 3.1.2. First, the hyperplanes of MDSVC, SVC, and the unit ball are shown in Figure 1a. By characterizing and minimizing our mean and variance, we can, thus, have the hypersphere of MDSVC as an inclined curved surface in the feature space, as indicated in red in Figure 1a. The intersection of the SVC's hypersphere and the unit sphere is a cap-like area. We further illustrate the main difference between MDSVC and SVC through a lateral view and top view, which are shown in Figure 1b,c, respectively. Figure 1b is the schematic diagram of the MDSVC's Cap and the SVC's Cap. We can find that the center a of MDSVC's hypersphere moves away from the center of the ball and inclines to the distribution of the overall data because of the mean and variance. In Figure 1c, we use Soft-Rsvc to represent the soft boundary of SVC. The centers of the three spheres, namely the unit ball, SVC's hypersphere, and MDSVC's hypersphere, are denoted by o, asvc, and a, respectively. We also use red points to indicate the SVs of MDSVC. As shown in Figure 1c, we can see how the boundary of MDSVC R is determined. Finally, we use Figure 1d to show the distribution of data points and the details of the Cap formed by SVC.

Preliminary
Let φ(x) be the mapping function induced by a kernel k, i.e., k In the feature space, we use the Gaussian kernel, and we derive k(x, x) = 1. The distance between a and x is φ(x) − a 2 , where . is the Euclidean norm and a is the center of the sphere. We denote X as the matrix whose i-th column is φ(x i ). In what follows in the rest of this subsection, we first give the definitions of statistics of mean and variance in clustering; we then present Theorems 1 and 2 to facilitate the formation of the variance; next, we employ the mean and variance (Equations (1) and (2)) to obtain and elucidate the final formula as a convex quadratic programming problem.

Definition 1.
The margin mean is defined as follows.
where e stands for the all-one column vector of m dimensions. Because we use the Gaussian kernel, we have k(x, x) = 1, which can facilitate the calculation. The reason for choosing this form of mean is that we incline to make the center of the MDSVC's sphere close to the denser part of the samples. Next, we define the margin variance.

Definition 2.
The margin variance is defined as follows.
The variance considers the distribution of the overall data rather than the distribution of SVs. Note that if we only characterize the mean in our method, the hyperplane would incline to dense clusters and there may appear more support vectors for the high density of the clusters, which will result in unbalance. However, we should realize that the mean is just the first step to adjusting the sphere of MDSVC. Next, we introduce the variance to adjust the boundary with less volatility. We can find that the variance quantifies the scatter of clustering. Additionally, we denote kernel matrix Q = X T X, where cult to obtain due to its complicated form, so we have to use an alternative way to address this issue. Thus, we use the following Theorem 1. Note that the formula of variance can be further simplified, so we employ Theorem 2 to elucidate and facilitate the form of the variance. Finally, we obtain the simplified form for the margin variance as in Equation (8).
The center of hypersphere a can be represented as follows, Proof of Theorem 1. Suppose that a can be decomposed into the span of φ(x i ) and an orthogonal vector v, that is where v satisfies φ(x i ) T v = 0 for all i, i.e., x T v = 0. Then we have the following formula Therefore, when minimizing a, v = 0 does not affect its value. The formula of mean is then derived as follows From the aforementioned formula, the mean is equivalent to modulus a in optimiza- For variance, we have the following form Thus, the variance is independent of v. The rest of the optimization objectives are also independent of v. Based on all of the aforementioned equations, a can be represented as the form of Equation (3).
is a column vector of the kernel matrix Q with the following form is a symmetric matrix. Therefore, H and P are both symmetric matrices. We deduce QG as follows According to Theorem 1, we have the following form of mean and variance

Minimizing the Mean and Variance
Referring to the above subsections, we define the formula of MDSVC as follows min R,a Consider that the center a of the sphere is closer to the denser part in the feature space as minimizing the mean, and then we minimize the value of λ 2 to make more points closer to a, resulting in fewer support vector points. Next, we simplify Equation (9).
Based on Theorem 1, Equation (9) leads to min R,α By introducing Lagrange multipliers β i , µ i , the Lagrange function of Equation (12) is given as follows L (R, α, ξ, β, µ By setting the partial derivatives {R, α, ξ} to zero for satisfying the KKT conditions, we have the following equations of derivatives Thus, we adopt G = ((λ 1 + 1)Q + H + P) −1 Q, where ((λ 1 + 1)Q + H + P) −1 refers to the inverse matrix of ((λ 1 + 1)Q + H + P). On the basis of these equations, we obtain vector A as follows Substituting Equation (15) into Equation (13), we thus have By substituting Equations (12)- (14) into Equation (11), Equation (11) is re-written as follows We notice that G = ((λ 1 + 1)Q + H + P) −1 Q, so D and F have the following form Referring to the above equations, thus, we derive our formula of MDSVC as follows Based on Theorem 2, D is symmetric and consists of positive elements. We can then make a conclusion that Equation (19) is a convex quadratic problem resulting from the convex objective function and convex domain β ∈ [0, C]. Thus, we can solve the objective function with convex quadratic programming.

The MDSVC Algorithm
Due to the simple box constraint and the convex quadratic objective function of our optimization problem, we adopt the DCD algorithm to minimize one of the variables continuously and keep the other variables fixed to obtain the closed form solution. For our problem, we adjust the value of β i with a step size of t to make f (β) reach the minimum value, while keeping other β k =i unchanged. Our sub-problem is thus as follows where e i = (0, . . . , 1 i , .., 0) m T denotes the vector with 1 in the i-th element and 0 is elsewhere. For function f, we have where d ii = e i T De i is the diagonal entry of D. Then we calculate the gradient by the following form As f (β) is independent of t, we can consider Equation (21) as a function of t. Hence, f (β + te i ) can be transformed into a simple quadratic function of t. Thus, we get the  (21) by setting the derivation of the aforementioned function with respect to t to zero. Therefore, t is represented as follows We denote β i iter as the value of β i at the i-th iteration, thus, the value of β i iter+1 can be obtained as Considering the box constraint 0 ≤ β i ≤ C of the problem, we can further obtain the final form of updating β i According to Equations (16) and (19), we have [∇ f (β)] i = 2e i T Qα. Algorithm 1 (MDSVC) describes the procedure of MDSVC with the Gaussian kernel. [ Step 4. Output: α, β.
Meanwhile, we give the analysis of the computational complexity of Algorithm MDSVC, where m denotes the number of the examples and n represents the number of features. We set maxIter to 1000 during our experiments, the time complexity of DCD, thus, can be cast as maxIter*m*m. Furthermore, we can infer that the time complexity of DCD in this paper is the sum of time complexity as shown in Table 1. Considering that m is much greater than n, thus, the time complexity of DCD is O(m 3 ), and the space complexity of DCD is O(m 2 ).

The Properties of MDSVC
We briefly introduce the properties of MDSVC in this subsection. Hereinafter, the points with 0 < β i < C will be referred to as support vectors (SVs); the points with β i = C will be called bounded support vectors (BSVs), which are the same as in SVC. Additionally, the SVDD [5] used cross-validation (leave-one-out) as the criterion to characterize the expectation of the probability of test error, and, then, they describe the expectation as follows E(P(error)) = num(SV) m (26) The above expectation is more suitable as a standard for adjusting the parameters in the experiments of SVDD rather than having a theoretical basis. It can only estimate the error of the first kind, i.e., the target class. By analyzing the above equation, we further infer that our algorithm can reduce the number of SVs to some extent compared with SVC. Thus, we can obtain better generalization performance compared with SVC theoretically. Inspired by SVDD and LDM, we give the expectation in a manner similar to the approach used in LDM. Theorem 3. The center. Let β represent the optimal solution of Equation (19) and E[R(β)] be the expectation of the probability of error, and then we obtain where

Proof of Theorem 3. Suppose
and the parameters of the sphere are R and a, respectively. As in [16], the expectation is calculated as below where γ((x 1 , y 1 ), . . . , (x m , y m )) is the number of errors produced during the leave-one-out procedure. Data points are divided into three categories. Note that if β i * = 0, the point is interior in the data space. The cluster of the interior points is totally up to the SVs regardless of the assignment of the cluster in the second stage of the MDSVC procedure based on the analysis of SVDD. Hence, we consider two cases as follows: (1) 0 < β i * < C, the data is the support point according to the SVC and KKT conditions, we have where e i is a vector with 1 in the i-th coordinate and 0 elsewhere. Incorporating Equation (16) into the aforementioned formula, we have φ(x i , a) ≤ β i * d ii 2 , where x i are SVs. Further, note that if x i is an SV, we have φ(x i , a) = a 2 = 1 − R 2 , which is a lemma proposed in CCL [9]. Thus, we rearrange φ(x i , a) ≤ β i * d ii 2 , and then obtain 1 ≤ β i * d ii 2(1−R 2 ) . (2) β i * = C, x i is the bounded SV (SVs) and must be misclassified in the leave-one-out procedure. Hence we have γ ((x 1 , y 1 where

Experimental Study
In this section, MDSVC is compared with k-means (KM) [4], optimal margin distribution clustering (ODMC) [22], spectral clustering (SC) [23], mean shift (MS) [24], and hierarchical clustering (HC) [25]. We adopt the results of K-means acting as a baseline rather than maximum margin clustering (MMC) [20] since it could not return results in a reasonable time for most datasets. We experimentally evaluate the performance of our MDSVC compared with the original algorithms of SVC on classic artificial datasets and several medium-sized datasets; that is, we focus on the difference between MDSVC and SVC. Table 2 summarizes the statistics of these data sets. All real-world datasets used for our experiments can be found at UCI (http://archive.ics.uci.edu/ml, 2 February 2021). In Table 2, all of the samples of artificial datasets, namely convex, dbmoon, and ring, are added with Gaussian noises, which are representative of different types of datasets. All algorithms are implemented with MATLAB R2021a on a PC with a 2.50 GHz CPU and 64 GB memory.

Evaluation Criteria
To evaluate the performance of MDSVC, we use two external indicators, clustering accuracy (Acc) and Adjusted Rand Index (ARI), as our performance metrics. Table 3 shows the definition of the metrics mentioned. Table 3. Formula of metrics.

Metrics Definition
Accuracy: m is the total number of samples. We use c i to represent the number of the i-th cluster points classified correctly. We predict the clusters r by performing clustering methods and then measure the accuracy according to the true label.
Adjusted Rand Index: [y 1 , y 2 , . . . , y s ] stands for the true labels of datasets, while [c 1 , c 2 , . . . , c r ] stands for the clusters separated by MDSVC. The sum of TP and TN that we need to obtain can represent the consistency between the clustering result and the result of the original cluster labels. We can distinctly compute it through the confusion matrix. The Rand index (RI), which equals (TP + TN)/C m 2 , represents the frequency of occurrence of agreements over all of the instance pairs. Finally, we can calculate the RI value. However, the RI value is not a constant close to zero for two random label assignments. The ARI, discounting the expected RI of random partition, can however address this issue.

Experimental Results and Analysis
In the process of SVC tuning, it is noted that there are often too many SVs or too few SVs, failing to form a better contour. Irrational SVs may not divide the clusters better and/or obtain higher precision. Based on this observation, we design experiments on the number of SVs with varying values of λ 1 and q. As mentioned before, the Gaussian kernel k(x, y) = exp(−q * x − y 2 ) is employed for nonlinear clustering, and we can derive k(x, x) = 1. We apply the commonly used dichotomy method to select the kernel width coefficient q.
Before conducting experiments on the evaluations for MDVSC and other clustering methods, we analyze the relationship between λ 1 and λ 2 about SVs on two artificial datasets and two real datasets in Figure 2. For the appropriate range of these two parameters, we can realize that the number of SVs increases when λ 1 increases as a is closer to the denser data in the feature space. Furthermore, the increase in λ 2 leads to a decrease in the number of SVs for less volatility in terms of distance from a because the sphere is in the right place with fewer SVs. Thus, it is instructive for us to adjust λ 1 and λ 2 to solve the problem of too many or too few SVs when q and C are given.
We show the results with respect to the corresponding performance metrics in Tables 4  and 5, where PERCENTAGE represents the percentage of the average number of SVs to the total data. We adopt/to represent the method has no need to compute the PERCENTAGE. We summarize the win/tie/loss counts for MDSVC in the last row compared with other methods. For a clearer comparison between MDSVC and SVC, q is selected from the same range [2 −7 , 2 7 ] to compute the PERCENTAGE.
In particular, the evaluation of datasets is shown in Tables 4 and 5. Table 4 shows that MDSVC is almost on par with SVC on artificial datasets. It is worth noting that our MDSVC can reduce the number of SVs significantly under the same conditions compared to SVC, i.e., the same q and C. In Table 5, although we note that both SVC and MDSVC have worse Acc or ARI on some datasets, MDSVC still obtains better results than SVC and other methods on most real datasets. Based on the analysis of the experiments, we derive that we can change the SVs by changing the other parameters, λ 1 and λ 2 , to achieve better performance when the parameters q and C are selected for MDSVC. In addition, in terms of the CPU time, MDSVC has superior performance on the datasets (ring, vehicle) with higher dimensions and larger size than SVC, as shown in Figure 3. Referring to the comparison of the CPU time between MDSVC and SVC, we indicate that MDSVC has two advantages: better performance and less running time.
The estimated clustering assignments on artificial datasets, convex, and ring, are shown in Figure 4. In order to show the clusters divided by SVs more intuitively and accurately, we draw the contour lines decided by R. We note that the SVC algorithm is almost always overfitting on artificial datasets when the boundary is optimal; that is, all data points are identified as SVs, and, thus, Figure 4 only shows the best non-fitting effect of SVC. Obviously, MDSVC is superior to SVC in terms of forming better boundaries on artificial datasets.
Considering Figure 4a-d, the boundaries of the convex and the dbmoon formed by MDSVC are more rational than SVC in terms of separating clusters. For the ring set, the challenge for SVC is to make rational boundaries with the appropriate number of SVs. MDSVC forms four more rational boundaries and, thus, separates the ring set into two clusters, as shown in Figure 4e, while SVC recognizes only two boundaries in Figure 4f. Moreover, the introduction of statistical items (non-negative), which makes the hyperplane closer to the denser part in the feature space, results in the value of R being larger than SVC. Therefore, it can be seen that we have obtained a greater boundary under the premise of not increasing outliers. In summary, MDSVC obtains better boundaries and a better presentation of the statistical information in the above datasets. We show the results with respect to the corresponding performance metrics in Tables  4 and 5, where PERCENTAGE represents the percentage of the average number of SVs to the total data. We adopt/to represent the method has no need to compute the PERCENTAGE. We summarize the win/tie/loss counts for MDSVC in the last row compared with other methods. For a clearer comparison between MDSVC and SVC, q is selected from the same range [2 −7 , 2 7 ] to compute the PERCENTAGE.    For further evaluation, we assess the impact of parameters on ARI, Acc, and PER-CENTAGE as the change of parameter values may have a significant influence on the clustering results. Percentage characterizes the level of SVs. For our MDSVC, there are three trade-off parameters λ 1 , λ 2 , C, and the kernel parameter q. We show the impact of λ 1 on ARI, Acc, and PERCENTAGE by varying it from 2 −5 to 2 5 while making the other parameters fixed as the optimal ones. As one can see from Figure 5e-h, the number of SVs In particular, the evaluation of datasets is shown in Tables 4 and 5. Table 4 shows that MDSVC is almost on par with SVC on artificial datasets. It is worth noting that our MDSVC can reduce the number of SVs significantly under the same conditions compared to SVC, i.e., the same q and C. In Table 5, although we note that both SVC and MDSVC have worse Acc or ARI on some datasets, MDSVC still obtains better results than SVC and other methods on most real datasets. Based on the analysis of the experiments, we derive that we can change the SVs by changing the other parameters, and , to achieve better performance when the parameters q and C are selected for MDSVC. In addition, in terms of the CPU time, MDSVC has superior performance on the datasets (ring, vehicle) with higher dimensions and larger size than SVC, as shown in Figure 3. Referring to the comparison of the CPU time between MDSVC and SVC, we indicate that MDSVC has two advantages: better performance and less running time.

Discussion
It has been proved that trade-off parameters, q and C, have a significant impact on the results of SVC [5,7]. Obviously, we may spend more time in finding the optimal parameters that characterize a better boundary of clusters for SVC. This will result in a large number of SVs during the tuning process, which may affect the partition of clusters and is unreasonable, obviously. We know that it is feasible to adjust parameter C to obtain better performance, but it comes at the cost of increasing outliers. To solve these problems and inspired by the margin theory, we reconstruct a new hypersphere to identify the clusters to make denser sets more easily divided by employing the margin distribution, and then we establish the corresponding theory. We circumvent the high complexity resulting from the variance by demonstrating Theorem 1 and employing the Gaussian kernel, and then we derive the convex optimization problem.
As for the MDSVC algorithm, we design the customized DCD method to solve the convex optimization problem [25]. MDSVC has two other trade-off parameters compared to SVC, namely, λ 1 , λ 2 . Furthermore, we demonstrate that both of them play an important role in MDSVC through experiments shown in Figure 2 and equations about hypersphere we derive in Section 2. In Figure 4, we can obtain some useful instructive insights as an avenue for adjusting the number of SVs. Therefore, we can obtain better performance by increasing the λ 1 value while there are few SVs. Moreover, we can increase λ 2 value to reduce SVs. If one focuses on forming better outlines of clusters, the recommendation is to control the ratio of λ 1 and λ 2 to between 10 −2 and 10 2 . Once the number of SVs changes drastically, there is no need for us to increase the value of λ 1 and λ 2 . Meanwhile, what we should be aware of is that λ 1 should not be zero. We further theoretically prove that the error has an upper bound in Section 3. Due to the lack of prior knowledge (true labels) of clustering algorithms, it is difficult for us to achieve our error bound in a manner similar to the approach used in LDM. We make it by taking the advantage of the error proposed in SVDD [6] and the lemma derived in CCL [9]. According to Figure 1b,c and Figure 4c-e, minimizing the mean and variance can make datasets properly outlined with a proper amount of SVs from a practical and theoretical perspective, while the outlines of SVC are inappropriate. However, we found that our method performed generally when the edge points of datasets are separated relatively densely, where edge points are a collection of relatively sparsely distributed points in the data space. Based on the experiments and formulas obtained; thus, we think that our method performs better on the datasets with edge points dispersing sparsely.
In short, the novel contribution of our work is that we redefine the hyperplane and the center in feature space considering the distribution of data to form better boundaries with a proper amount of SVs. Furthermore, experimental results in most datasets indicate that MDSVC achieves better performance, which further demonstrates the superiority of our method. In the future, we will design a corresponding method to improve the performance, which redefines the clustering partition.

Conclusions
In this research, we propose MDSVC, which employs the mean and variance, leveraged by marginal theory and SVM. The novelty of MDSVC lies in its reconstruction of the hyperplane, reducing the number of support vector points compared to SVC under the same conditions, and the improvement in generalization performance. We also have theoretically proven that our generalization performance has been improved, and the error has an upper bound. To optimize the objective function of MDSVC, we employ the DCD method with high applicability and efficiency. Experimental results in most datasets show that MDSVC achieves better performance, which indicates its superiority.
In our future work, we will study the partition of the second stage to further improve the performance of our method. At the same time, to assess the application potential of our algorithm, we will employ our model in more application scenarios.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Support Vector Clustering
Support vector clustering (SVC) introduces soft boundary as a tolerance mechanism to reduce the number of boundary support vector points. The algorithm is robust to noise and does not need to know the number of clusters. However, the effectiveness of the algorithm depends on the selection of the kernel width coefficient q and the soft boundary constant C. Clearly, parameter adjustment is time-consuming. SVC has the formulation as follows min R,a where parameter C is used for controlling outliers and C m ∑ i=1 ξ i is a penalty term, and then the slack variables ξ i are used as tolerance. SVC looks for the smallest enclosing sphere of radius R, under the constraints φ(x i ) − a 2 ≤ R 2 + ξ i , where ||.|| is the Euclidean norm and a is the center of the hypersphere. We can use the Lagrange function to solve the problem After we take the derivative of the above formula, the dual problem can be cast as follows max Thus, we can define the distance of each point in the feature space Finally, R 2 has the following form The radius of the hypersphere is Here, the Lagrange multiplier β i ∈ (0, C), x i is a support vector (SV). The point is a boundary support vector point (BSV) when β i = C. SVC used the adjacency matrix A ij to identify the connected components. For two points x i and x j Finally, the clusters can be defined according to the adjacency matrix A ij. The time complexity of calculating the adjacency matrix is O (vm 2 ), in which v is the number of samples for the line segment. The quadratic programming problem can be solved by the SMO algorithm, the memory requirements of which are low, and it can be implemented using O (1) memory at the cost of a decrease in efficiency. The obvious shortcoming of SVC lies in the high cost of partition.