An Improved Variable Kernel Density Estimator Based on L 2 Regularization

: The nature of the kernel density estimator (KDE) is to ﬁnd the underlying probability density function ( p.d.f ) for a given dataset. The key to training the KDE is to determine the optimal bandwidth or Parzen window. All the data points share a ﬁxed bandwidth (scalar for univariate KDE and vector for multivariate KDE) in the ﬁxed KDE (FKDE). In this paper, we propose an improved variable KDE (IVKDE) which determines the optimal bandwidth for each data point in the given dataset based on the integrated squared error (ISE) criterion with the L 2 regularization term. An effective optimization algorithm is developed to solve the improved objective function. We compare the estimation performance of IVKDE with FKDE and VKDE based on ISE criterion without L 2 regularization on four univariate and four multivariate probability distributions. The experimental results show that IVKDE obtains lower estimation errors and thus demonstrate the effectiveness of IVKDE.


Introduction
It is very important for many machine learning algorithms to estimate the unknown probability density functions (p.d.f.s) of given datasets, e.g., Bayesian classifiers [1,2], density-based clustering algorithms [3,4], and mutual information-based feature selection algorithms [5,6]. In order to obtain the unknown p.d.f., an effective kernel density estimator (KDE) should be thoroughly constructed in advance. The classical KDE training method is the Parzen window method [7], which uses the superposition of multiple kernel functions with a fixed Parzen window (i.e., bandwidth) to fit the unknown p.d.f. The most used kernels [8] include uniform, triangular, Epanechnikov, biwieght, triweight, cosine, and Gaussian kernels. Compared with the kernels, the bandwidth plays a more important role in p.d.f. estimation: a large bandwidth will result in an over-smoothed estimation, while a small bandwidth will lead to an under-smoothed estimation.
How to determine an optimal bandwidth is a key point for training a KDE. In order to select an appropriate bandwidth, the effective error criterion should firstly be designed [9]. Commonly used error criteria include the integrated squared error (ISE) and the mean integrated squared error (MISE). Currently, there are two main ways to design KDE, i.e, the classical Parzen window method with the fixed bandwidth parameter named the fixed kernel density estimator (FKDE) and the modified Parzen window method with the variable bandwidth parameter named the variable kernel density estimator (VKDE). The representative studies corresponding to FKDE and VKDE are summarized as follows.

•
Fixed kernel density estimator. The rule-of-thumb-based KDE (RoT-KDE) [10] was designed based on the asymptotic MISE (AMISE) criterion by assuming the unknown p.d.f. as normal p.d.f. Due to the inappropriate assumption of the true p.d.f., RoT-KDE is a naive KDE and inclined to select the over-smoothed bandwidth [8].
Apart from the sample and direct RoT-KDE, there are three other sophisticated KDEs, i.e., bootstrap-based KDE (BS-KDE) [11] , biased cross-validation-based KDE (BCV-KDE) [12], and unbiased cross-validation-based KDE (UCV-KDE) [13]. BS-KDE determined the optimal bandwidth based on the MISE criterion by using the bootstrap technology to estimate the true p.d.f. BCV-KDE was also designed based on the MISE criterion, which calulated the optimal bandwidth by establishing the relationship between the true p.  [14], who introduced the variable bandwidths for each data point in the given dataset and represented the bandwidth with distance from the data point to its k-th nearest neighbor. Jones [15] clarified the difference between VKDE employing a different bandwidth for each data point and VKDE with bandwidth as a function of estimation location. Terrell and Scott [16] derived the optimization rule for variable bandwidths based on the asymptotic mean squared error (AMSE) criterion. Hall et al. [17] improved the VKDE proposed in [16] by further analyzing the rates of VKDE convergence. Wu et al. [18] proposed a strategy to express the variable bandwidth in VKDE as the product of a local bandwidth factor and a global smoothing parameter. Suaray [19] proposed a VKDE for the p.d.f. estimation of censored data. Klebanov [20] proposed an axiomatic approach to construct a VKDE which guaranteed the density estimation invariance under linear transformations of original density as well as under splitting of density into several well-separated parts.
Compared with FKDEs, the main merit of VKDEs is that the variable bandwidths can flexibly adjust the importance of data points during the p.d.f estimation. This paper focuses on the improvement of VKDE. Jones [21] discussed the roles of ISE and MISE criteria in p.d.f. estimation. We consider using the ISE criterion to calculate the optimal bandwidths for the VKDE. The mathematical analysis indicates that the ISE criterion usually leads to an over-smoothed p.d.f. estimation. Inspired by the integration of empirical and structural risks, we propose an improved variable KDE (IVKDE) which determines the optimal bandwidth for each data point based on the ISE criterion with an L 2 regularization term in this paper. The ISE and L 2 regularization represent the empirical and structural risks for constructing VKDE, respectively. In order to obtain the optimally variable bandwidths, an effective optimization scheme is developed to solve the improved objective function. We conduct the exhaustive experiments to validate the rationality, feasibility, and effectiveness of IVKDE. The experimental results show that IVKDE is convergent and able to obtain the desirable p.d.f. estimation. In comparison with FKDE and VKDE based on the ISE criterion without L 2 regularization on four univariate and four multivariate probability distributions, IVKDE obtains lower estimation errors and thus demonstrate the effectiveness of IVKDE.
The remainder of this paper is organized as follows. In Section 2, we describe the basic principles of the variable kernel density estimator. In Section 3, we introduce the improved variable kernel density estimator. In Section 4, we provide experimental results and analysis. Finally, in Section 5, we conclude this paper and discuss future works.

Proposed IVKDE
In this section, we firstly provide an improved VKDE which uses an L 2 regularization term-based objective function to evaluate the efficiency of variable bandwidths. Then, a bandwidth optimization algorithm is developed to solve the optimal variable bandwidths based on the above-mentioned objective function.
The purpose of VKDE training is to make the estimated p.d.f.f VKDE (x) as close to the true p.d.f. f (x) as possible. In Equation (6), we can find that the performance of VKDE is only related to the selection of bandwidth vectors. We want to select the bandwidth vectors which can minimize the error between p.d.f.f VKDE (x) and f (x). In order to measure the estimated error, an effective error criterion should firstly be designed. The integrated squared error (ISE) (7) is used in our proposed IVKDE to measure the estimated error.
In Equation (7), we can see that the third term 2 dx is unrelated to the unknown bandwidth vectors. Thus, the optimal variable bandwidth vectors can be obtained by minimizing the simplified ISE criterion: Equation (8) is a data-driven error measurement which easily leads to a data-adaptive KDE and further makes the estimated p.d.f. more inclined to fit the given dataset X. In order to guarantee the good generalization capability of KDE, we give the following objective function to select the bandwidth vectors for our proposed IVKDE: where the second term is the L 2 regulation term, h n 2 is the L 2 norm of bandwidth vector h n , n = 1, 2, · · · , N , and ξ > 0 is the regulation factor.
Substituting Equation (6) into and respectively, wheref VKDE−n (x n ), n = 1, 2, · · · , N is a leave-one-out estimator trained through an unbiased cross-validation (UCV) method. IVKDE needs to use the optimal bandwidth vectors that can minimize the objective function with the L 2 regulation term. In order to solve the optimal bandwidths, we should firstly calculate the partial derivative of L h 1 ,h 2 , · · · ,h N with respect to h nd , n = 1, 2, · · · , N , d = 1, 2, · · · , D. Let where and We can find that it is very difficult to calculate the analytic solution of h nd , n = 1, 2, · · · , N , d = 1, 2, · · · , D from ∆h nd = 0. Here, we design the following Algorithm 1 which uses the gradient descent method to solve the optimal bandwidths for IVKDE based on the objective function as shown in Equation (9). Algorithm 1 iteratively determines the optimal bandwidths based on the decaying learning rate adjustment. Because the minimization of L h 1 ,h 2 , · · · ,h N is required, the negative gradient is used in Algorithm 1.

Experimental Results and Analysis
We conduct three experiments based on eight different probability distributions as shown in Table 1 to validate the rationality, feasibility, and effectiveness of the proposed IVKDE. The graphics of these eight p.d.f.s for the given parameters are presented in Figure 1. Table 1. Four univariate and four multivariate probability distributions ( f (i) (x) in bimodal, trimodal, and quadrimodal normal distributions is the two-dimensional normal distribution with mean vector µ (i) and covariance matrix Σ (i) ).

Experiential Setup
The rationality is to check the convergence of Algorithm 1, the feasibility is to show the estimation capability of IVKDE to the given p.d.f.s, and the effectiveness is demonstrated by comparing the estimation performances of IVKDE with FKDE and VKDE. For FKDE and VKDE, the optimal bandwidths are also determined with the gradient descent method. The synthetic datasets obeying the above-mentioned distributions can be accessible in any country accessed via our BaiduPan (https://pan.baidu.com/s/1YhkkrckQA_e2GNd8 haLE1g, accessed on 25 June 2021) with extraction code vn6j. All the estimators are implemented with the Python programming language and run on a PC with an Intel(R) Quad-core 3.00 GHz i5-7400 CPU and 16 GB memory.

Rationality of IVKDE
We test the convergence of Algorithm 1 based on the random data points obeying F, normal, two-dimensional normal, and bimodal normal distributions with the following parameters: In Figure 2, we can see that Algorithm 1 is convergent for the different regulation factor ξs on the given p.d.f. The curves of bandwidth sums firstly decrease and then keep stable with the increase in iteration numbers. This indicates that Algorithm 1 is convergent and can find the optimal bandwidths for IVKDE.
We use Algorithm 1 to determine the optimal bandwidths for each distribution based on the random data points, where the parameters of Algorithm 1 are set as T Max = 1500,

Effectiveness of IVKDE
On eight probability distributions, as shown in Table 1, we compare the p.d.f. estimation performance of IVKDE with FKDE and VKDE. The parameters of these three kernel density estimators are summarized in Table 2. The comparative results among FKDE, VKDE, and IVKDE are listed in Table 3. We use the mean absolute error (MAE) to evaluate the training and testing performances of these three kernel density estimators. Assume the true and estimated p.d.f. values for the given dataset X are y 1 , y 2 , · · · , y N and y 1 ,ŷ 2 , · · · ,ŷ N , respectively. Then, the MAE on dataset X is calculated as ξ is the regulation factor; T Max is the maximum number of iterations; α Max is the maximum value of learning rate; α Min is the maximum value of learning rate; δ is the stopping threshold; h (0) nd , n = 1, 2, · · · , N , d = 1, 2, · · · , D are the initial bandwidths. In

Conclusions and Future Works
This paper presented an improved variable kernel density estimator (IVKDE) by using both integrated squared error (ISE) and L 2 regularization to determine the optimal bandwidths. The L 2 regularization can effectively avoid the over-smoothed bandwidth selection. The experimental results demonstrated the rationality, feasibility, and effectiveness of the proposed IVKDE. Future works will be carried out according to the following research directions: (1) using IVKDE to estimate the unknown p.d.f. for a large-scale dataset [23] and (2) finding the practical applications for IVKDE in data mining and machine learning fields.

Data Availability Statement:
The data presented in this study are available in BaiduPan https://pan. baidu.com/s/1YhkkrckQA_e2GNd8haLE1g (accessed on 25 June 2021) with extraction code vn6j.

Acknowledgments:
We would like to thank the editors and two anonymous reviewers whose meticulous readings and valuable suggestions helped us to improve this paper significantly after two rounds of review.