Estimating the Major Cluster by Mean-Shift with Updating Kernel

The mean-shift method is a convenient mode-seeking method. Using a principle of the sample mean over an analysis window, or kernel, in a data space where samples are distributed with bias toward the densest direction of sample from the kernel center, the mean-shift method is an attempt to seek the densest point of samples, or the sample mode, iteratively. A smaller kernel leads to convergence to a local mode that appears because of statistical fluctuation. A larger kernel leads to estimation of a biased mode affected by other clusters, abnormal values, or outliers if they exist other than in the major cluster. Therefore, optimal selection of the kernel size, which is designated as the bandwidth in many reports of the literature, represents an important problem. As described herein, assuming that the major cluster follows a Gaussian probability density distribution, and, assuming that the outliers do not affect the sample mode of the major cluster, and, by adopting a Gaussian kernel, we propose a new mean-shift by which both the mean vector and covariance matrix of the major cluster are estimated in each iteration. Subsequently, the kernel size and shape are updated adaptively. Numerical experiments indicate that the mean vector, covariance matrix, and the number of samples of the major cluster can be estimated stably. Because the kernel shape can be adjusted not only to an isotropic shape but also to an anisotropic shape according to the sample distribution, the proposed method has higher estimation precision than the general mean-shift.


Introduction
When measuring a certain physical quantity, a few abnormal values, hereinafter designated as outliers, are included among the normal measured values, thereby exacerbating measurement noise. Frequently in science and engineering, some effort is necessary to estimate the true statistical parameters of the physical quantity from these measured values and the included outliers. Because the measurement noise generally follows a Gaussian distribution with mean zero, all samples from the major cluster are Gaussian-distributed around the true value. The problem described above is summarized to estimate the parameters of the major cluster, such as the mean, covariance matrix, and the number of samples included in the major cluster.
Because the mean equals the mode in a Gaussian distribution, if the outliers do not affect the sample mode of the major cluster, then the problem above can be replaced by a mode-seeking problem of the major cluster. Fukunaga and Hostetler [1] first proposed the mean-shift method, which was subsequently generalized by Cheng [2]. It is therefore known as a convenient iterative method for mode-seeking. The mean-shift was shown to be equivalent to the method that seeks a local maximum by the steepest gradient algorithm for the probability density distribution estimated using the kernel 1. Letting the mean µ x of sample x n , n = 1, . . . , N be the initial value of the mean estimatorμ N of major cluster, thenμ 2. Consider a Gaussian distribution p(x; µ W , σ W ) with the mean µ W and standard deviation σ W as the kernel function in the value direction. Here, the mean µ W of kernel function is found by the mean estimator of major cluster The standard deviation σ W is assigned to be an appropriate size as discussed later in Section 2.2. 3. Weight a n , n = 1, . . . , N for each sample x n , n = 1, . . . , N weighted by such a Gaussian kernel is a n = 1 A p(x n ; µ W , σ W ).
However, A in Equation (3) above is a normalization coefficient for which the sum of the weight a n is equal to 1, as We use this weight a n to calculate the sample mean µ x with x n , n = 1, . . . , N as a n x n .
4. The value of mean estimatorμ N of the major cluster is updated by the following equation: 5. If the variation of the value of mean estimatorμ N is equal to or less than the predetermined fixed value, then the update process is terminated. Otherwise, return to 2 and repeat the iteration.

Shortcomings and Solution of the General Mean-Shift Method
The general mean-shift method estimates the modes of the underlying probability density function. From the definition of a probability density, if the random variable X of N data points x i , i = 1, 2, 3, . . . , N in one-dimensional space R has density f , then For any given h (bin bandwidth or kernel bandwidth), we can estimate P(x − h < X < x + h) by the proportion of the sample falling in the interval (x − h, x + h). Thus, a natural estimatorf of the density is given by choosing a small h and settingf Here, N x denotes the number of samples falling in the interval (x − h, x + h). To express the estimator more transparently, define the weight function ω(x; h) by The estimator can be expressed as below [17]: Replace the weight function ω by a general kernel function K(x; σ) with standard deviation σ, which satisfies the condition and the kernel estimator for the probability density functionf (x) at point x can be expressed aŝ The general mean-shift is an attempt to ascertain the local modes of density functionf (x), which correspond to the zeros of the gradient ▽ xf (x) = 0. Therefore, the type of kernel function K(x; σ) and the kernel bandwidth σ both directly affect the performance of general mean-shift method. Fixing the type of kernel function to Gaussian kernel, we specifically examine the influence of the pre-set of the kernel bandwidth in general mean-shift.
To confirm the influence of fixed kernel bandwidth on estimation accuracy in a general mean-shift method, we set various fixed kernel bandwidths in advance. Here, we summarize the numerical and experimentally obtained results for general mean-shift method as discussed in Section 4. Figure 1a presents the bias error between the estimated value in a general mean-shift method and the true value when we select various kernel bandwidths in advance. The horizontal axis shows a selection of different kernel bandwidths. The vertical axes respectively show the bias error between the estimated value for the mean and the true mean value, and the variance of the mean value. While selecting different fixed kernel bandwidths, we estimated the mean of the major cluster, which is distributed as shown in Figure 2 for 1000 trials. Furthermore, we computed the bias errors using the equation described in Section 4.2. Figure 1a shows that, when we enlarge the fixed kernel bandwidth, the mean estimator is more susceptible to outliers. The bias error in general mean-shift method increases. Otherwise, when we decrease the kernel bandwidth, the number of samples involved in the mean estimation decreases. The local mode can easily become the convergence point of the iterative process. In addition, the bias error in general mean-shift method increases. The kernel bandwidth should be set in the range of 0.5-1.5. As shown in Figure 1b, with enlargement of the kernel bandwidth, the estimation variance in general mean-shift method decreases. Therefore, the optimal kernel bandwidth is 1.5. Because the maximum value of these variances is very small and, because it does not exceed 0.06, if we select the kernel bandwidth within this range of 0.5-1.5, we can ensure the unbiasedness and consistency of the mean estimator in general mean-shift method. However, not knowing the true mean of the major cluster beforehand, we cannot calculate the bias error in general mean-shift method. Therefore, we cannot choose the appropriate kernel bandwidth based on the comparison result shown in Figure 1a. Indeed, the proper pre-set of the kernel bandwidth constitutes an important difficulty. The optimal kernel bandwidth depends on the existence range of outliers, the number of samples belonging to the major cluster and the distribution that the major cluster follows. In the absence of prior knowledge, the kernel bandwidth is often fixed as appropriate to 1/2 the time of the standard deviation of the whole sample when the whole sample contains the major cluster and the outliers in signal processing [18]. For clustering in image processing or other multiple applications, it is still difficult to preset the kernel bandwidth properly in a general mean-shift method. When the kernel bandwidth is inappropriate, the kernel bandwidth becomes a factor that degrades the estimation accuracy.
As follows, based on the general mean-shift method, we propose a method to change the kernel bandwidth adaptively in accordance with simultaneous estimation of the mean (for a multi-dimensional case, the mean vector) and the standard deviation (for a multi-dimensional case, the covariance matrix) of a major cluster at each iteration. We need not set the kernel bandwidth properly in advance.

Derivation of Major Cluster Standard Deviation σ N from Sample Standard Deviation σ x
Here, the Gaussian distribution with mean µ and standard deviation σ is represented by p(x; µ, σ). It is abbreviated as p(x; σ) especially for µ = 0. We use the two following equations for the two Gaussian distributions: We assume that the influence of outliers is small such that the sample mode is not biased from the mean µ N . If the general mean-shift method with the sufficiently small fixed kernel bandwidth decided by the standard deviation of the kernel starts the iteration from an appropriate initial value, then the influence of the outliers on estimation decreases gradually as the estimate converges. Therefore, it is sufficient to consider only the samples from the major cluster x n , n = 1, . . . , N N when the estimate converges to their true value. In addition, the mean µ N of the major cluster and the mean µ W of the Gaussian kernel coincide near the convergence point. Even if coordinate transformation is performed so that both are 0, generality is not lost. Therefore, we let µ N = µ W = 0 here for analysis. The variance σ 2 x of the sample x n , n = 1, . . . , N N weighted by a n , n = 1, . . . , N N is Weight a n is a Gaussian kernel given by Equations (3) and (4). In addition, N is replaced by N N . The expected value of the sample variance σ 2 x is calculated after substituting Equation (3) into Equation (15) as The variance of 1 A is sufficiently smaller than the dispersion of other parts. Therefore, it can be approximated to the following equation based on the assumption that the major cluster follows a Gaussian distribution, as The approximation is discussed later in Appendix B. Here, we calculate the expected value of A by Equations (4) and (13) as The expected value of other part becomes according to Equation (14). In other words, after being weighted by a Gaussian kernel with mean 0 and standard deviation σ W , the expected value of variance σ 2 x of the sample which follows a Gaussian distribution with mean 0 and standard deviation σ N is according to Equations (18) and (19). Equation (20) above can be transformed to This expression shows that standard deviation σ N can be estimated from the standard deviation σ x of the sample, which is weighted using a Gaussian kernel with mean 0 and standard deviation σ W asσ In addition, using Equation (18), the number N N of samples belonging to the major cluster can be estimated asN Adaptive change of the standard deviation σ W of the kernel related to the estimated valueσ N of the standard deviation is sufficient for each update because the mean µ N of the major cluster and the standard deviation σ N can also be estimated. Specifically, the standard deviation σ W of the kernel is assigned to be r times the estimated valueσ N , although it depends on the existence range of outliers. We designate this r as a scale factor. Regarding appropriate r, we will examine this point in a numerical experiment discussed later.

Mean-Shift with Updating Kernel
Based on the discussion presented in Section 3.1, at each iteration of the general mean-shift method, the standard deviation σ N is estimated simultaneously in addition to the mean value µ N . Therefore, we propose a new mean-shift method that adaptively changes the standard deviation σ W of the kernel. The algorithm is summarized as presented below: 1. Let the mean µ x of sample x n , n = 1, . . . , N be the initial value of the mean estimatorμ N of the major cluster and let standard deviation σ x of this sample be the initial value of the standard deviation estimator σ N of the major cluster aŝ 2. Consider a Gaussian distribution p(x; µ W , σ W ) with mean µ W and standard deviation σ W as the kernel function in the value direction. Here, the mean µ W and the standard deviation σ W are given respectively by the estimated valueμ N of the mean and the estimated valueσ N of the standard deviation of the major cluster: Here, mean µ W and variance σ W of the Gaussian kernel are not estimators, although they change when the kernel updates. 3. Weight a n , n = 1, . . . , N for each sample x n , n = 1, . . . , N weighted by such a Gaussian kernel p(x; µ W , σ W ) is calculated using Equations (3) and (4). We use this weight a n to calculate the sample mean µ x and standard deviation σ x with x n , n = 1, . . . , N as shown below: 4. The values of mean estimatorμ N , standard deviation estimatorσ N , and number of samples estimatorN N of the sample are updated, respectively, by the following equations: 5. If the variations of the values of these estimators are equal to or less than the predetermined fixed value, then the update process is terminated. Otherwise, return to 2 and repeat the iteration.

Update Process of Mean-Shift with an Updatable Kernel
For the proposed method, we use iteration to confirm the process by which the estimated values of the mean vector, the covariance matrix, and the number of samples converge to true values of the major cluster. Although no restriction is made of the dimension of data to which the proposed method is applicable, to illustrate and explain the distribution of data and update process, two-dimensional data are targeted for analysis. Herein, we obtain the major cluster with N N = 3000 points generated in two-dimensional normal distribution with the mean vector µ N = (0, 0) T and variance covariance matrix as The outliers with N O = 200 points are distributed uniformly within the range of Figure 2 shows an example of the generated sample in (x 1 , x 2 ) space. Symbol • in the figure represents the coordinates of each point. The points spreading in the central elliptical shape belong to the major cluster. Other points distributed in a square shape on the upper left are outliers. In the figure, the solid ellipse represents a contour line where 99% of the M-dimensional normal distribution defined by the mean vector µ N and the covariance matrix C N fall within it. Later, we present the mean vector µ N and covariance matrix C N , or their estimates. In general, as discussed in Appendix C.2, the initial estimated value of the meanμ N and covariance matrixĈ N for major cluster can be assigned respectively to the mean and covariance matrix of all samples. However, we set the initial kernel having mean vectorμ N = (−2, 3) T and covariance matrix intentionally to be located and shaped sufficiently apart from the major cluster. To demonstrate how the estimated value converges to the true value with updating, the scale factor is r = 1.0. The update ends when it satisfies all conditions for which the sum of squares of the change amountμ N is 0.01 or fewer, the sum of squares of the change amount ofĈ N is 0.01 or fewer, and the square of the change amount ofN N is 30 or less. As described earlier, the solid ellipse shown in Figure 3 represents the estimated value of mean vectorμ N , covariance matrixĈ N , and numberN N of samples for each update in the proposed method. In Figure 3, the estimated valuesμ N ,Ĉ N ,N N are shown to converge to the true values µ N , C N , N N corresponding to Figure 2 as the update progresses, although they start from more or less bad initial values. Here, for the estimated valueN N , we have accuracy to one decimal place.

Influence of Kernel Bandwidth on Estimation Accuracy (Unbiasedness)
An exceedingly important property required for estimators is unbiasedness: a property by which the expected value of the estimated value coincides with the true value. If no statistical bias in the estimated value exists, then it represents that the estimation is accurate. Assuming that the parameter is θ, we investigate the unbiasedness of the estimatorθ. If parameter θ is a scalar, then the bias error is the difference E[θ] − θ between the expected value and the true value θ of the estimator. Otherwise, if parameter θ is a vector or matrix, then the bias error is the square root E[θ] − θ 2 of the sum of squares over all the elements. It can be evaluated whether the bias error is zero. As explained below, it demonstrates that the initial value of the kernel bandwidth has less influence on the unbiasedness of the estimated value in the proposed method discussed in Appendix C than in the general mean-shift method introduced in Appendix A.
The distributions that major cluster and outliers follow, the numbers of samples N N , N O , scale factor r, and update ending condition are the same as those described in Section 4.1. The initial estimated value of mean vectorμ N is the mean vector of all samples. The initial estimated value of covariance matrix is assigned toĈ N = σ 2 I. In the general mean-shift method, the covariance matrix of the kernel is C W = σ 2 I. Under the conditions presented above, the mean vector µ N , the covariance matrix C N , and the number N N of samples are estimated using the general mean-shift method and the proposed method. In addition, because it is impossible to obtain the expected value in numerical experiments, the expected value is replaced by the average value of the estimated values for 1000 trials that change the random number.
In the proposed method, σ 2 is the initial value of the kernel bandwidth. It corresponds to the pre-set value of the kernel bandwidth in a general mean-shift method. When this σ 2 is changed to various values, the bias errors of the estimated value of the mean vector µ N , covariance matrix C N , and number N N of samples are calculated. Results are presented respectively in Figure 4a-c.
The horizontal axis shows the selection of different kernel bandwidth. The vertical axes respectively show the bias errors for estimators µ N , C N , and N N . In this figure, symbol corresponds to the proposed method. The symbol △ represents the bias errors in a general mean-shift method. However, because the covariance matrix and number of samples cannot be estimated in a general mean-shift method, only the results obtained using the proposed method are shown in Figure 4b,c. The scale on the vertical axis of the figures is fixed to represent 10% of errors at full scale. In the following figures, the same scale applied to these figures will be used unless specified otherwise.   Figure 4a shows that the bias error increases linearly and that the unbiasedness is lost when the kernel bandwidth σ 2 approximately exceeds the range of 0.5-1.5 represented by symbol ↓ in a general mean-shift method because, as the kernel becomes larger, the outliers fall within the range of the kernel, which greatly affects the mean estimation of the major cluster. For this reason, the proper set of the kernel bandwidth is an important difficulty in a general mean-shift method. However, the kernel bandwidth is adjusted according to the estimated value of covariance matrix of a major cluster at each iteration in the proposed method. Therefore, it is less susceptible to the influence of initial value σ 2 . Furthermore, in Figure 4b,c, it is the same situation in the estimations of covariance matrix C N and number N N of samples.
While maintaining the ratio of the number N N of samples of major cluster and the number N O of samples of the outliers to 3000:200 and changing the number N = N N + N O of samples from 1000 to 90,000, the variance of each estimate value of the mean vector µ N , covariance matrix C N , and number N N of samples are obtained using our proposed method, as shown in Figure 5. The horizontal axis shows the selection of different numbers of samples corresponding to the whole samples. The vertical axes respectively represent the bias errors for estimators µ N , C N , and N N . Because the proposed method is independent of the initial value σ 2 , the initial value σ 2 is fixed to 1.5. Figure 5 shows that these estimators are unbiased for a finite number of samples.

Influence of the Scale Factor r Value on Estimation Accuracy
In the proposed method, we need not select the initial value of kernel bandwidth in advance because the kernel bandwidth is changed adaptively. The pre-set of the initial value shows some difficulty in influencing the estimation accuracy. Instead, the problem of optimal setting of the scale factor r occurred. Scale factor r represents the ratio of the kernel bandwidth (standard deviation) to the major cluster width (standard deviation). Therefore, the smaller the scale factor, the smaller the kernel bandwidth (standard deviation) is set with respect to the major cluster width (standard deviation). From the viewpoint of estimation accuracy, the kernel bandwidth (standard deviation) should be sufficiently large but not cover the outliers. In other words, if the outliers exist at the distance from the mode of major clusters more than three times the standard deviation of the major cluster, according to three-sigma rule of thumb, the kernel bandwidth should be the same as the standard deviation of major cluster, which means r = 1. Otherwise, if there are a certain number of outliers within a standard deviation away from the mode of the major cluster, the kernel bandwidth is expected to be 1/3 of the standard deviation of the major cluster, which means r = 1 3. If the distribution of the major cluster and the outliers is specified completely, then it is possible to derive the theoretical formula of the optimal scale factor r as a parameter. However, because the purpose is to estimate the distribution of the major cluster and the outliers, then, even if a theoretical formula for scale factor r is derived, it cannot be used for estimation. Derivation of the theoretical formula for scale factor r has no great value. Therefore, as described below, we investigate the influence of the selected value of this scale factor on the estimation accuracy.
The distributions that major cluster and outliers follow, number N N , N O of samples, and update ending condition are the same as those in Section 4.1. As shown in Section 4.2, the initial values of the estimated value of mean vector and covariance matrixμ N ,Ĉ N are given, respectively, by the mean vector and covariance matrix of the whole samples. We select scale factor r to be various values and estimate the mean vector µ N , covariance matrix C N , and number N N of samples using the proposed method. The bias errors of each estimated value is presented in Figure 6a,c. The horizontal axis represents the selection of various scale factors r. The vertical axes respectively represent the bias errors for estimators µ N , C N , and N N .  Figure 6 shows that the bias errors of any estimate increases and that the unbiasedness is lost when scale factor r is selected as a value larger than a certain value because, when the kernel bandwidth increases, it becomes more susceptible to outliers, as with the general mean-shift method shown in Figure 6. However, when scale factor r is selected as a small value, the bias error is increased extremely. The unbiasedness is lost relative to the covariance matrix C N and number N N of samples, although it is not readily apparent on mean vector µ N . The reason for this is explainable as presented below.
If we select scale factor r as a value smaller than one, the kernel bandwidth becomes small because of a lack of the practical number of samples that contribute to the estimation. For that reason, the estimation precisions of mean vector µ x and standard deviation σ x,m are deteriorated. The deterioration of this estimation accuracy results from the small number of samples. Consequently, the estimated error has normality, but does not include bias error. As shown in Equation (A32), the estimated valueμ N of the mean vector is the sample mean vector µ x . The estimation equation of the standard deviationσ N and numberN N of samples is a nonlinear function of the sample standard deviation σ x,m , as shown in Equations (A19) and (A20). In general, normality is lost by a nonlinear transformation. Therefore, the estimation errors of both the standard deviationσ N and the numberN N of samples are converted to the bias errors by the nonlinear transformations, even if the estimation error of the sample standard deviation σ x,m had normality. Figure 6 shows that the appropriate value of the scale factor r is in the range of 0.5 ≤ r ≤ 1.5, but it depends on the characteristic of the target data. For example, the lower limit increases when the number of samples is small. The upper limit decreases when the outliers approach a major cluster. Comparing the bias error with the general mean-shift indicates that the selection of scale factor r need not be the same as the situation of kernel bandwidth as shown in Figure 4 because the range in which the bias error can be kept low is wide.

Verification of Consistency
The goodness of the estimator is evaluated by accuracy and precision. Accuracy is evaluated as the bias error, as discussed in Section 4.2, whereas the precision is evaluated by the variance of estimated values. Before comparing the estimation precision of a general mean-shift method with the proposed method, one must confirm the consistency of the estimated values in both methods. Consistency is an important property required for the estimator. It indicates the characteristics by which the variance of the estimated values approaches 0 as the number of samples used for estimation increases.
The distributions that major cluster and outliers follow, in addition to the update ending conditions, are the same as those described in Section 4.1. As shown in Appendix C.2, the initial valuesμ N ,Ĉ N of the estimate values of the mean vector and covariance matrix are given respectively by mean vector µ x and covariance matrix C x of the whole samples. To ensure that the estimator is unbiased, we select the scale factor as r = 1.0 based on the discussion of the proposed method in Section 4.3, and the kernel as C W = σ 2 I, σ 2 = 1.5 based on the discussion for a general mean-shift method in Section 4.     Figure 7a-c are drawn as a logarithmic graph; the slope should be −1 in fact. Therefore, the relation between the sample number of samples and the estimation variance is approximated using a linear polynomial with the slope fixed at −1. The straight line represented by the approximate linear polynomial is superimposed by a solid line in these figures. These results demonstrate the validity of the approximation. Here, we simply define the estimation variance as the 0-order coefficient of the approximate linear polynomial or the virtual estimation variance corresponding to sample number N = 1. Regarding to the estimation variance, we compare the estimation precision of proposed method with the general mean-shift method.

Estimation Precisions of the Proposed and General Mean-Shift Methods
The distributions that major cluster and outliers follow, the number N N , N O of samples, and the update ending condition are the same as those described in Section 4.1. As shown in Appendix C.2, the initial values of the estimated value of mean vector and covariance matrixμ N ,Ĉ N are given, respectively, by the mean vector µ x and covariance matrix C x of the whole samples.
We select scale factor r to be various values and use the proposed method to estimate the mean vector µ N , covariance matrix C N , and number N N of samples. Similarly, letting the covariance matrix of kernel be C W = σ 2 I, we estimate the mean vector µ N using the general mean-shift method while the kernel bandwidth σ 2 is changed to various values. The estimation variance of estimated valueμ N is presented in Figure 9.
From Figure 8a-c, the estimation variance of each estimated value of the mean vector µ N , covariance matrix C N , and number N N of samples decreases with respect to r, monotonically. If r is small, then the kernel bandwidth decreases. The number of substantial points involved in the estimation decreases. Therefore, the estimation precision deteriorates. On one hand, if r is large, then the estimation precision decreases. Because bias error occurs as shown in Figure 6, it is not desirable as an estimator. However, the estimation variance related to general mean-shift method decreases monotonically with respect to kernel bandwidth σ 2 , as shown in Figure 9. The reason is exactly the same as in the case of the proposed method.
Finally, the estimation precision of a general mean-shift method and that of the proposed method are compared. Regarding the general mean-shift method, the estimation is unbiased if σ 2 ≤ 1.5, as shown in Figure 4. However, the estimation precision increases as σ 2 becomes larger, as shown in Figure 9. In the general mean-shift method, the optimal selected value of the kernel bandwidth is σ 2 = 1.5. The estimation variance at kernel bandwidth σ 2 = 1.5 is read from Figure 9: its value is shown by a horizontal dotted line in Figure 8a. In the proposed method, 0.5 ≤ r ≤ 1.5 is the suitable range of the scale factor r. In this range, the estimation variance of the proposed method is half or less than half of that of the general mean-shift method. The proposed method has higher estimation precision than the general mean-shift method that has the optimal kernel bandwidth for the following reason. In the general mean-shift method, the kernel shape is expressed as an isotopic shape because the covariance matrix of the kernel is represented as a diagonal matrix in which all diagonal elements are equal. Otherwise, in the proposed method, the kernel shape can take an arbitrary anisotropic shape because the covariance matrix of the kernel can take an arbitrary matrix that satisfies the condition as a covariance matrix. The practical number of samples that contribute to the estimation can be maximized by adjusting the kernel shape to the distribution of samples.

Discussion
Numerical experiments in two-dimensions described in Sections 4.2, 4.3 and 4.4 yield results for the major cluster and outlier model shown in Figure 2. The purpose of our numerical experiment is to confirm whether the estimators (mean, covariance, number of sample of the major cluster) in the proposed method are unbiased and consistent without a proper pre-set of kernel bandwidth. If these estimators are consistent unbiased estimators, then the proposed method can achieve accurate estimation of the mean, covariance, and the number of samples of the major cluster. We chose the two-dimensional numerical experiment to observe the dynamic changes of the kernel more intuitively during the iteration. The iteration process is shown in Figure 3. In the numerical experiments described herein, the major cluster follows the Gaussian distribution. If the proposed method performs well on other distributions, then the scope of application of the proposed method can be expected to expand. We discuss the scope of application of the proposed method in two aspects as presented below.
For a one-dimensional signal processing field, the assumption of normality is not regarded as being such a severe strong assumption. Yokota and Ye [19] proposed the radical root, or r-th root, transform of the power spectrum series that follows the chi-square distribution, such that the transformed series follows a quasi-Gaussian distribution. Lotter and Vary [20] proposed a spectral amplitude estimator with a parametric super-Gaussian speech model for approximating the probability density distribution of the real speech spectral amplitudes. In fact, the parametric super-Gaussian distribution can approximate the Rayleigh-Laplace-Gamma distribution or other distributions exactly. Ye and Yokota [21] applied the radical root transformation to the super-Gaussian distributions. Thereby, they confirmed that the super-Gaussian distribution after r-th radical root transformation can be quasi-Gaussian distributed. By radical root transformation [21], the proposed method is applicable for major clusters that follow different distributions other than a Gaussian distribution. However, for clustering in image processing or other multiple dimensional applications, the major cluster following a Gaussian distribution is truly a strong assumption.
In addition to the problem addressed in this paper, many methods exist to solve this problem other than the mean-shift method. They have been discussed as described below. Under the normality assumption, Grubbs' test [22][23][24] and Thompson Tau test [25] are known as methods for testing whether the sample farthest from the sample mean is an outlier. These tests are applied sequentially from the samples that are outermost from the sample mean, but the number of outliers is only valid at most to several. Moreover, applying the tests to multi-dimensional data are not easy. If the outliers follow a Gaussian distribution and if the number of clusters in which the outliers are distributed is known, then, by applying a Gaussian mixture model [26][27][28], the mean and covariance matrix of major cluster can be estimated easily using the Expectation-maximization(EM) algorithm [29,30]. However, such an assumption cannot be applied generally to the outliers.
In fact, selection of the kernel bandwidth is an important issue that strongly affects the result of the general mean-shift algorithm compared to setting of the kernel type. Therefore, we only used the Gaussian kernel to make a presumption here. However, there are many commonly used kernel functions in addition to the Gaussian kernel, such as the Epanechnikov Kernel, the Uniform Kernel, the Quartic Kernel, and the Triweight Kernel. Application of it to other kernel functions according to the derivation of this article will undoubtedly make this research more comprehensive and general. Such application is expected to be an important part of our future research.

Application
Considering a stochastic process x(t), the short-time Fourier spectrum centering on time t with a suitable window length is denoted as X(t, f ). Here, f represents the frequency. Let X f (t) ≡ X(t, f ) be denoted as the spectrum series if frequency f is fixed. By applying the non-steady-state analysis of the stochastic process, the spectrogram P(t, f ) = X(t, f ) 2 denotes the power of the short-time Fourier spectrum X(t, f ). Because the frequency f is fixed, P f (t) will be designated as the power spectrum series.
Yokota and Ye [19] proposed a power spectrum estimation method robust for sudden noise. The method uses the radical root transformation to quasi-Gaussian distribution. The following concludes the process of the noise estimation algorithm proposed by Yokota and Ye [19]: (1) Obtainpower spectrogram P(t, f ) from the noisy signal. We chose a pulse code modulation(PCM) recording of a noisy signal that contains a certain amount of sudden noise for analysis and computes the spectrogram with a Hamming window length of 10 ms achieving a 50% overlap between adjacent frames by short-time Fourier transformation. Figure 10a presents an example of a noisy signal for analysis and the corresponding spectrogram.
(2) Perform the following process for each frequency f : (2-1) Use the radical root transformation in the power spectrum series P f (t) with the transformation parameter r * = 3.314. Thereby, obtain the new quasi-Gaussian distributed power spectrum series P 1 r * f (t). Figure 10b portrays a histogram of the power spectrum series at f = 512 Hz before the transformation.
(2-2) Compute the mode value of transformed power spectrum series P 1 r * f (t) by kernel density estimation [5]. Then, put the mode value as the corresponding time average value P noise ( f ) of the noise power spectrum series. Figure 10c depicts a histogram of the transformed power spectrum series at f = 512 Hz, the kernel density estimation [5] with proper kernel bandwidth and the major cluster estimation using our proposed method.
(2-3) Compute the time average value P( f ) of the noise power spectrum series from the time average value P noise ( f ) as (3) Obtain P( f ) as an estimation of the noise power spectral density. In the noise estimation algorithm [19], the mode estimation accuracy directly affects the noise estimation result. As Figure 10c shows, kernel density estimation [5] can be replaced by our proposed method for comparison. The proper pre-setting of kernel bandwidth is also important in kernel density estimation [5]. It exhibits a strong influence on the resulting estimate similarly to the general mean-shift method. To illustrate its effects, we obtained the noise power spectrum series from the PCM recording, which is shown in Figure 10a, for analysis. Figure 11 portrays the relation between the kernel bandwidth and kernel density estimation. The histogram shows the true density. The broken curve is under-smoothed because it includes too many spurious data artifacts arising from use of 0.000001 bandwidth, which is too small. The dotted curve is over-smoothed because using 0.0001 bandwidth obscures much of the underlying structure. The solid curve with 0.00003 bandwidth is regarded as optimally smoothed because its density estimate is close to the true density. To assess the performance of the proposed method, general mean-shift method, and kernel density estimation for a noise estimation algorithm [19], this study uses PCM recordings of air-conditioning noise with some sudden noise, as shown in Figure 10a and without sudden noise, respectively, as test data and the true value. Noisy signal data in PCM recordings are not compressed. They have no power consumption. Figure 12 presents comparison results for noise estimation using the proposed method and kernel estimation. Here, we preset the kernel bandwidth as 0.0001. As Figure 12 shows, in the case in which an inappropriate kernel bandwidth is set in advance, noise estimation using our proposed method closely approximates the true noise, but the estimation accuracy using the kernel estimation is not high. True noise Estimated noise using proposed method Estimated noise using kernel density estimation Figure 12. Comparison of the proposed method to kernel estimation for noise estimation.

Conclusions
The study described in this paper has addressed the problem of proper pre-setting for the fixed search kernel in a general mean-shift method. To improve the estimation accuracy, a new mean-shift method was proposed in which the mean vector and covariance matrix of the major cluster are estimated at each iteration. Then, the kernel bandwidth and shape are adjusted corresponding to the estimates. In numerical experiments, we compared the estimation accuracy and precision of the proposed method and of the general mean-shift method. The experimentally obtained results demonstrate that the estimation accuracy and precision of the proposed mean-shift are higher than those of a general mean-shift method. Moreover, the proposed mean-shift can estimate the covariance matrix and the number of samples of major clusters effectively and correctly. Neither can be estimated using the general mean-shift method. These results were confirmed through formal experimentation, the results of which indicated the superior performance of our method compared to that of the general mean-shift method.

Conflicts of Interest:
The authors declare that they have no conflict of interest related to this report or to the study it describes.

Appendix A. General Mean-Shift for a Multi-Dimensional Situation
Even when the target for data are multi-dimensional, it is fundamentally the same as the one-dimensional data. Sample x n , n = 1, . . . , N of the M-dimensional column vector includes the major cluster of N N points and a few outliers. The major cluster follows an M-dimensional Gaussian distribution with mean vector µ N and covariance matrix C N . Here, the mode of the major cluster is not biased from the mean vector µ N under the influence of N O point outliers. The iteration process in the multi-dimensional mean-shift method is the following: 1. Let the mean vector µ x of sample x n , n = 1, . . . , N be the initial value of the mean estimatorμ N of the major clusterμ 2. Consider a M-dimensional Gaussian distribution p(x; µ W , C W ) with mean vector µ W and covariance matrix C N as the kernel function in value direction. Here, the mean mean vector µ W of kernel function is ascertained by the mean estimator of major cluster In addition, covariance matrix C N is assigned to be an appropriate size as discussed in Section 2.2. 3. The weight a n , n = 1, . . . , N for each sample x n , n = 1, . . . , N weighted by such a Gaussian kernel is a n = 1 A p(x n ; µ W , C W ). However, We use this weight a n to calculate the sample mean vector µ x with x n , n = 1, . . . , N as 4. The value of mean vector estimatorμ N for the major cluster is updated using the following equation:μ 5. If the value variation of mean vector estimatorμ N is equal to or less than the predetermined fixed value, the update process is terminated. Otherwise, return to 2 and repeat the iteration.

Appendix B. Proof of Equation (17)
Equation (16) can be rewritten as p(x n ; C W )x 2 n are both random variables.
Obviously, if the standard deviation of the denominator is sufficiently small compared to the expected value of the denominator, Equation (16) can be approximated as shown below because the denominator can be regarded as a simple variable rather than a random variable as shown in Equation (17). Hereafter, it is proved that the standard deviation can be as small as possible with respect to the expected value of the denominator when the number of samples N N → ∞.
Proof. The denominator on the right side of Equation (A7) has the form of The expected value E(y) and the standard deviation σ(y) are If the number N N of samples is sufficiently large, which means N N → ∞, σ(y) for E(y) converges to 0. p(x n ; C W ) is non-negative because it is a probability density distribution. That is, since the random variable x n follows the probability density distribution f (x) defined by x ≥ 0, the expected value E[x n ] of x n is always positive. Regardless of the number of samples N N , it becomes E[y] = E[x n ], so that, with the number of samples N N → ∞, the denominator can reduce the standard deviation as much as possible relative to the expected value.
The expected values and standard deviations for various probability density distributions f (x) defined by x ≥ 0 are presented in Table A1. The table shows that, for all probability density distributions shown in this table, the standard deviation σ(x n ) does not become larger than the expected value E(x n ) beyond the order. The same is probably true for other probability density distributions not listed in this table. Therefore, corresponding to the number of samples N N = 100, the standard deviation σ(y) can be about one-tenth of the expected value E(y). Practically speaking, Equation (A8), i.e., the approximation of Equation (A7), holds. Here, we extend derivation of the estimated value for standard deviation σ N in the one-dimensional derived in Section 3.1 to multi-dimensional. The major cluster is assumed to follow a multi-dimensional (M-dimensional) normal distribution. Although the covariance matrix generally does not become a diagonal matrix, it is possible to re-coordinate the coordinate axes so that the covariance matrix becomes a diagonal matrix by appropriate orthogonal transformation. Furthermore, the coordinate axes are shifted such that the mean vector becomes a zero vector. In this section, we consider the variable (x 1 , . . . , x M ) in such a transformed coordinate system. We let the variables be x = (x 1 , . . . , x M ) T and denote the standard deviation of each variable by σ N = (σ N,1 , . . . , σ N,M ) T . On the newly revised coordinate axes, because the covariance is zero, a M-dimensional normal distribution is represented as a direct product of the one-dimensional normal distribution of each variable as The kernel function in the value direction is also assumed to be a Gaussian distribution with a mean zero vector and a diagonal covariance matrix. Because the standard deviation of each variable is σ W = (σ W,1 , . . . , σ W,M ) T , the Gaussian distribution of kernel function is Using this Gaussian kernel, the weight a n , n = 1, . . . , N N for the sample x n = (x 1,n , . . . , x M,n ) T , n = 1, . . . , N N can be denoted as However, A in the above equation is The sample variance σ 2 x,m weighted by a n is σ 2 x,m = N N n=1 a n x 2 m,n , m = 1, . . . , M.
For the same reason, under the one-dimensional case, by substituting Equation (A14) into Equation (A16), the expected value of the sample variance σ 2 x,m can be approximated as Applying Equation (A12) to Equation (A15) and using Equation (13), the expected value of A is found as while using Equation (14), the remainder of Equation (A17) is when using the standard deviation σ x,m of the sample weighted with a Gaussian kernel with standard deviation σ W . Furthermore, using Equation (A18), we can estimate the number N N of samples belonging to a major cluster asN The standard deviation σ W of the Gaussian kernel is assigned adaptively as r times the estimated valueσ N of the standard deviation at each iteration. The appropriate value of the scale factor r is discussed later in relation to a numerical experiment.
Appendix C.2. Mean-Shift Method with Updating Kernel 1. The mean vector µ x and the covariance matrix C x of the whole samples are determined using the following equations: The initial values of the mean vectorμ N and the covariance matrixĈ N of the major cluster are assigned asμ 2. One can consider a multi-dimensional Gaussian distribution p(x; µ W , C W ) with mean vector µ W and covariance matrix C W as the kernel function in the value direction. Here, the mean vector µ W and covariance matrix C W of the kernel function are determined as Actually, r 2 in the above equation is derived from the fact that the covariance matrix has the squared order of the standard deviation. 3. Weight a n for each sample x n weighted by such a Gaussian kernel is calculated using Equations (A3) and (A4). The mean vector µ x and the covariance matrix C x are determined using the following equations: 4. The value of mean vector estimatorμ N is updated using the following equation: Let be an eigenvalue decomposition of the covariance matrix C W , which can be represented as a symmetric matrix of the kernel. The diagonal elements of the diagonalized matrix Λ W are eigenvalues of C W ; they represent the variances σ 2 W,1 , . . . , σ 2 W,M along the directions represented by each of the column vectors of orthogonal matrix V W . In addition, the diagonal element of is the variance σ 2 x,1 , . . . , σ 2 x,M of V W in the column vector direction in the sample covariance matrix C x . According to Equation (A20), we can estimate the number N N of samples belonging to the major cluster by the standard deviation σ N,1 , . . . , σ N,M , which is obtained by σ 2 W,1 , . . . , σ 2 W,M and σ 2 x,1 , . . . , σ 2 x,M in Equation (A19). LetΛ N be the diagonal matrix that has the estimated σ N,1 , . . . ,σ N,M as the diagonal elements. UsingΛ N , the covariance matrixĈ N is updated with the following equation:Ĉ The estimated valueN N of the number of samples belonging to a major cluster is updated using the following equation:N 5. If the value variations ofμ N ,Ĉ N ,N N are equal to or less than the predetermined fixed value, then the update process is terminated. Otherwise, return to 2 and repeat the iteration.