A Quantized Kernel Learning Algorithm Using a Minimum Kernel Risk-Sensitive Loss Criterion and Bilateral Gradient Technique

: Recently, inspired by correntropy, kernel risk-sensitive loss (KRSL) has emerged as a novel nonlinear similarity measure deﬁned in kernel space, which achieves a better computing performance. After applying the KRSL to adaptive ﬁltering, the corresponding minimum kernel risk-sensitive loss (MKRSL) algorithm has been developed accordingly. However, MKRSL as a traditional kernel adaptive ﬁlter (KAF) method, generates a growing radial basis functional (RBF) network. In response to that limitation, through the use of online vector quantization (VQ) technique, this article proposes a novel KAF algorithm, named quantized MKRSL (QMKRSL) to curb the growth of the RBF network structure. Compared with other quantized methods, e.g., quantized kernel least mean square (QKLMS) and quantized kernel maximum correntropy (QKMC), the efﬁcient performance surface makes QMKRSL converge faster and ﬁlter more accurately, while maintaining the robustness to outliers. Moreover, considering that QMKRSL using traditional gradient descent method may fail to make full use of the hidden information between the input and output spaces, we also propose an intensiﬁed QMKRSL using a bilateral gradient technique named QMKRSL_BG, in an effort to further improve ﬁltering accuracy. Short-term chaotic time-series prediction experiments are conducted to demonstrate the satisfactory performance of our algorithms.


Introduction
Online kernel learning (OKL) has become increasingly popular in machine learning due to the fact that it requires much less memory and a lower computational cost to approximate the desired nonlinearity incrementally [1][2][3][4][5].As a member of OKL, kernel adaptive filters (KAFs) have attracted much attention because of their advantages, including universal approximation capabilities, reasonable computational complexity, and the fact that they lead to convex optimization problems [6].Mapping data from an input space into the reproducing kernel Hilbert space (RKHS) through a reproducing kernel [7,8], the nonlinear KAF is obtained in the input space through the linear structure of the RKHS.Recently, the KAF has been widely used in signal processing, such as channel estimation, noise cancellation, and system identification [9][10][11][12][13].Then, there are some typical nonlinear adaptive filtering algorithms, such as the kernel least mean square (KLMS) algorithm [14], the kernel affine projection algorithm [15], the kernel recursive least square (KRLS) algorithm [16], and many others [17][18][19][20][21][22].
These algorithms mentioned above may fail to achieve desirable performance in non-Gaussian noise environments, owing to the fact that they are usually developed on the basis of the mean square error (MSE) criterion, which is a mere second-order statistic and very sensitive to outliers [23].In the past few years, information theoretic learning (ITL) has been proven to be very efficient and robust in non-Gaussian signal processing [24][25][26].Different from the conventional second-order statistical measures such as the MSE, ITL uses the information theory descriptors of divergence and entropy as nonparametric cost functions for the adaptive systems [24].Through Parzen kernel estimation, ITL can capture higher-order statistics of data and achieves better performance than the MSE, especially in non-Gaussian and nonlinear situations [27].In consideration of this, ITL also offers an effective and efficient way for robust optimization, which has been a popular methodology in the last decade while addressing some optimization problems in the presence of uncertain input data [28][29][30].
In particular, correntropy is a local similarity measure of ITL, defined as a generalized correlation function in kernel space with an effort to estimate the similarity between two random variables [31,32], while achieving some successful applications, e.g., state estimation, image processing, and noise control [33][34][35].In addition, the corresponding maximum correntropy criterion (MCC) has been used as a cost function to derive various robust nonlinear adaptive filtering algorithms, such as the kernel maximum correntropy (KMC) algorithm [36] and the kernel recursive maximum correntropy (KRMC) algorithm [27], and they show better performance than those classical second-order schemes, e.g., KLMS and KRLS.However, correntropic loss (C-Loss) is a non-convex function, and may converge slowly, especially when the initial value is far away from the optimal value.Recently, a modified similarity measure called kernel risk-sensitive loss (KRSL) was derived in [37], which can be more "convex" and can achieve faster convergence speed and higher filtering accuracy, while maintaining the robustness to outliers.After applying KRSL to adaptive filtering, a new robust KAF algorithm was accordingly proposed [37].It is called the minimum kernel risk-sensitive loss (MKRSL) algorithm, which could be superior to some existing methods and is attracting growing attention.
One of the limitations in the KAF is the linearly growing radial basis functional (RBF) network structure with each input sample, which leads to the increase of memory requirement and computational complexity [17].Hence, many sparsification methods have been developed to curb the growth structure, such as the surprise criterion (SC) [38], novelty criterion (NC) [39], coherence criterion [40], and approximate linear dependency (ALD) criterion [16].Although these methods can significantly curb the network size, the high computing cost still imposes a very challenging obstacle to their practical applications.Moreover, these methods purely discard the redundant data, which may lead to a reduction in filtering accuracy.Then, the online vector quantization (VQ) method was applied to KAFs, constraining their network size through a simple distance calculation.The quantization method is computationally simple, and the redundant data are used to update the coefficient of the closest center, which leads to the improvement of filtering accuracy.
To limit the network size of the MKRSL algorithm, in this article we propose a novel KAF algorithm, named quantized MKRSL (QMKRSL), using the online VQ method.However, the QMKRSL using traditional gradient descent method may not make full use of hidden information between the input and output spaces, where the difference between those samples in a same quantization region could be ignored [41].To further improve solution accuracy, we also propose an intensified QMKRSL using a bilateral gradient technique named QMKRSL_BG, which can adjust the coefficient vector of the closest center and the current desired output related to a same quantization region simultaneously when a new input sample is discarded.
The rest of this article is organized as follows.Section 2 provides a brief review of KRSL and MKRSL.Then, the details of the proposed algorithms QMKRSL and QMKRSL_BG are described in Section 3. Finally, short-term chaotic time-series prediction experiments are conducted in Section 4, and the conclusion is summarized in Section 5.

The Kernel Risk-Sensitive Loss (KRSL) Algorithm
Correntropy measures the similarity of two random variables in the neighborhood of the joint space controlled by the kernel bandwidth [32].This locality allows the correntropy to be robust to outliers.Based on the MCC, many robust learning algorithms have been developed [42,43].Among them, the kernel bandwidth is a key parameter in the MCC.Generally, a smaller kernel bandwidth makes the algorithm more robust to outliers, but results in slow convergence speed and poor accuracy.On the other hand, when kernel bandwidth becomes larger, the robustness will be significantly reduced when outliers appear.To achieve better performance, a new similarity measure, i.e., KRSL, was proposed in [37], which can be more "convex" and thus can achieve faster convergence rate and higher filtering accuracy while maintaining the robustness to outliers.
Given two random variables X and Y with joint distribution function F XY (x, y), the KRSL is defined as follows: where λ > 0 represents the risk-sensitive parameter, k σ (•) is a shift-invariant Mercer kernel, and σ represents the kernel bandwidth that controls the range in which similarity is estimated.In addition, E {•} denotes the expectation operator.Typically, the kernel used in KRSL is the Gaussian kernel, given by: where e = x − y.In practice, since the joint distribution of X and Y is usually unknown, we only use a finite number of data samples {x(i), y(i)} N i=1 to approximate the expectation: Then, the approximation of KRSL can also be considered as a distance between the vectors In contrast to correntropy, the performance surface of the KRSL can be more "convex".The areas around the optimal solution are relatively flat to reduce the misadjustments, and the areas away from the optimal solution become steep to speed up the convergence.Moreover, the areas further away from the optimum gradually become entirely flat to avoid the big fluctuations caused by large outliers.Hence, KRSL can provide a more efficient solution that can achieve both a faster convergence rate and higher filtering accuracy while maintaining the robustness to outliers.Remark 1. Actually, both C-Loss and KRSL are non-convex functions.However, compared with C-Loss, the performance surface of KRSL can make the gradient-based search algorithms approach the optimal solution more effectively and accurately.Hence, in this article we call KRSL more "convex" than C-Loss.

Minimum Kernel Risk-Sensitive Loss (MKRSL) Algorithm
Just like MSE and MCC, KRSL function can also be used as a cost function in adaptive filters.In so doing, the goal is to minimize the KRSL function between the desired signal d(i) and the filter output y(i).This optimization principle is called the MKRSL criterion, given by: where u(i) represents the input vector, is the prediction error at time i, N is the number of samples, k σ 1 (•) is a Mercer kernel with kernel bandwidth σ 1 which is used to calculate the KRSL, and ω is the estimated value of the filter weight vector.
In accordance with the stochastic gradient descent method, the instantaneous cost function of the MKRSL algorithm at time i is defined as: Then the process of updating the weight can be easily derived as: where ω(i) denotes the estimated weight vector at iteration i, µ is a learning update parameter, and is the step-size parameter.
The performance of linear adaptive filters will degrade dramatically when the mapping between input u and desired output d is nonlinear.Therefore, the input u(i) is transformed into a high-dimensional feature space as ϕ(u(i)) through the kernel-induced mapping (7): where U is the input domain and the dimensionality of F is infinite while using Gaussian kernel.
For simplicity, we denote ϕ(u(i)) = ϕ(i).Then, we can obtain the weight update form for the MKRSL algorithm: The sequential learning rule for the MKRSL algorithm is shown as: where f i is the estimate of the nonlinear mappings between inputs and outputs at iteration i, k σ 2 (•) is another kernel (Gaussian) with the kernel bandwidth σ 2 .Here, k σ 2 (•) is used to calculate the inner product in the RKHS via the following kernel trick: Through iterations, the system output to a new input u(i + 1) can be solely expressed as: where α j (i) is the coefficient of center u(j).
For any λ, the weight update will approach zero when the error tends to infinity, which means that the MKRSL algorithm will be robust to large outliers.The details of the MKRSL algorithm are summarized in Algorithm 1.Here, α(i) denotes the coefficient vector at iteration i, and α(i) = α j (i) i j=1 .Additionally, C(i) denotes the center set at iteration i.

The Quantized MKRSL (QMKRSL) Algorithm
As mentioned above, the MKRSL will generate an RBF network that grows linearly with input.This growing structure results in the significant increase of computing costs and memory requirement, especially in the case of continuous adaptation [17].The online VQ method has been successfully applied to KAFs for containing their linearly growing RBF network.Here, we incorporate the online VQ method into the MKRSL, thus proposing the QMKRSL algorithm.Here, the QMKRSL algorithm is obtained by quantizing the feature vector ϕ(i) in the weight update Equation (8), which can be shown as: , where Q(•) represents the quantization operator in F. However, considering that the dimensionality of the feature space is very high, we usually perform the quantization operation in the input space U. Therefore, the learning rule for the QMKRSL algorithm can be derived as: where Q(•) is a quantization operator in U. Now, the QMKRSL algorithm has been obtained, and the details are described in Algorithm 2. Here, γ 0 is the quantization size, and denotes the center set at iteration i.Specifically, in the case that the current sample u(i) needs to be quantized, the network topology of QMKRSL is shown in Figure 1.Here, C q * (i − 1) is the closet center of u(i), is the update of the coefficient vector generated by the redundant data u(i), and l = size(C(i − 1)).

Initialization:
Choose parameters η, λ, and σ 1 , // compute the distance between u(i) and where q * = arg min u(i) − C q (i − 1) ; (5) else As we can see, the quantization approach merely uses the Euclidean distance to determine whether the new input should be added to the dictionary, and updates the coefficient vector of the closest center with the redundant data to improve accuracy.This method is easy to compute and takes full advantage of redundant data rather than discarding them directly, making it possible to avoid the limitations of traditional sparsification methods.
However, it should be noted that the coefficient update process of QMKRSL ignores the difference between the new input sample and its closest center.In addition, the difference between the corresponding desired outputs of the samples in the same quantization region is also neglected.Hence, there is still room for improvement of the QMKRSL algorithm, and the intensified algorithm will be presented in the next subsection.

The QMKRSL Using Bilateral Gradient Technique (QMKRSL_BG)
In the iterative process, QMKRSL only considers the change of input space, but does not take into account of the difference between the samples in the same quantization region, which may make it unable to make full use of the information hidden in the input and output spaces.Therefore, an efficient way to update the desired output is needed.To further improve the performance, we propose an intensified QMKRSL using bilateral gradient technique, named QMKRSL_BG.
Using the stochastic gradient descent method to minimize the instantaneous cost function (5) in the output space, the current desired output is updated by: where ∇ d(i) J (i) represents the gradient of the cost function in the desired output direction, η d is a learning update parameter, and is the step-size parameter used to update the current desired output.
Then, the prediction error and cost function could be updated by: Similarly, using the stochastic gradient descent method to minimize J U (i) in the input space, the coefficient update rule of the closest center is derived as: where ∇ α q * (i−1) J U (i) is the gradient of the cost function with respect to the coefficient vector, η α is a learning update parameter, and is the step-size parameter used to update the coefficient of the closest center.In addition, S C q * (i−1),u(i) = k σ 2 (C q * (i − 1), u(i)) reflects the difference between the input u(i) and the its closest center C q * (i − 1), which can be seen as a weight parameter.
Considering the difference between samples in the same quantization region, the update rules shown in ( 14)∼( 17) could be used when the new sample need to be quantified.For other cases, the update rule for QMKRSL_BG is the same as that of QMKRSL.A specific description of the proposed QMKRSL_BG is summarized in Algorithm 3.
Here, it can easily be found that, through the use of the gradient descent method in output space, QMKRSL_BG employs the prediction error caused by current desired output to update the coefficient vector.Thus, the QMKRSL_BG considers the difference between the desired outputs of samples which are very close in the input space.Moreover, an additive term S C q * (i−1),u(i) is introduced in the coefficient update using gradient descent method in input space, which reflects the difference between the current input u(i) and the its closest center C q * (i − 1).Consequently, the QMKRSL_BG can obtain more information from the input and output spaces, so as to achieve better filtering accuracy.Algorithm 3 Quantized MKRSL using the bilateral gradient technique (QMKRSL_BG) algorithm.

Complexity Analysis
Here, we will analyze the time complexity of the proposed two algorithms.We can see from the update process that MKRSL shares the computational simplicity of MCC.As shown in (11), the time complexity of MKRSL is O(N), where N represents the number of input samples.Like other quantized methods, e.g., quantized kernel least mean square (QKLMS) and quantized kernel maximum correntropy (QKMC), the key parts of QMKRSL are the calculation for online VQ and updating coefficient vector α(i).Each input sample needs to calculate the Euclidean distance from the dictionary (center set), which means that the complexity of the online VQ method is linearly related to the size of dictionary.According to the above analysis on the output expression of QMKRSL, the time complexity of those two parts of the QMKRSL is O(L), where L represents the network size.Hence, the overall computational complexity of QMKRSL is equal to O(L).Compared with QMKRSL, the proposed QMKRSL_BG only increases the computational effort slightly due to updating the desired output, and it does not affect its practical application.

Simulation Results and Discussion
In this section, we conduct simulations related to short-term chaotic time-series prediction, with the purpose of validating the performance of the proposed algorithms QMKRSL and QMKRSL_BG.

Dataset and Metric
The Lorenz chaotic system is a nonlinear dynamical system with chaotic flow, known for its butterfly shape, which is generated from the following differential equations [44]: Here the parameters are set as β = 8 3 , δ = 10, and ρ = 28.Using this setting, the time-series data are obtained with sampling period 0.01.
We use the first component, i.e., x, to perform the prediction task.The first 1000 samples of the processed Lorenz time-series are shown in Figure 2. In our simulations, we utilize the previous five samples u(i) = [x(i − 5), x(i − 4), x(i − 3), x(i − 2), x(i − 1)] T to predict the current point x(i), where u(i) and x(i) represent the input vector and the corresponding desired output, respectively.Here, we select 4005 continuous samples to generate the training input set (5 × 4000) and the corresponding desired output set (4000 × 1), and use the following 1005 samples to generate the testing input set (5 × 1000) and the corresponding desired output set (1000 × 1).
In our simulations, the mean square error (MSE) is utilized to evaluate the performance of our proposed algorithms.It is defined as follows: where N represents the number of predicted values.
In addition to the original MKRSL, the filtering performance is also compared with other quantized methods, including QKLMS and QKMC, under the same simulation condition to demonstrate the superiority of our algorithms.To verify the performance of these algorithms in different noise environments, various noise are adopted in the simulations.They include zero-mean Gaussian noise with variance 0.04, Uniform noise distributed over [−0.

Simulation Results
The quantization size γ is a key parameter in quantized KAF algorithms.Hence, we firstly analyze how the performance will be influenced by the quantization size.Generally speaking, the larger the quantization size, the more input vectors need to be quantized to their nearest center, which means that when the quantization size becomes larger, the network size decreases dramatically to a certain degree.While using QMKRSL, the effect of γ on final network size under different noise environments can be seen in Figure 3.Meanwhile, the final network sizes of QKLMS, QKMC and QMKRSL_BG are equal to that of QMKRSL, because they use the same online VQ method for sparsification, and the quantization parameter has no effect on MKRSL.
Furthermore, those parameters for all the algorithms under different noise environments are selected in accordance with Table 1.Using these settings, we also show the final testing MSE of these five algorithms under four types of noise environments with different quantization size in Figure 4. Here, the final testing MSE is determined by the average value of the last 200 iterations in the learning curves, and for each iteration, the testing MSE is calculated using 1000 test data.It can be found from Figure 4 that the efficient performance surface makes QMKRSL outperform QKLMS and QKMC in terms of testing MSE.In addition, the bilateral gradient descent method enables QMKRSL_BG to further improve filtering accuracy for all quantization sizes.Due to the fast decrease of network size, the testing MSE will increase significantly when the quantization size increases to a certain degree.Hence, the QMKRSL and QMKRSL_BG are superior to MKRSL when the quantization size is bounded within a certain range.Therefore, we can conclude that the algorithms QMKRSL and QMKRSL_BG can effectively address time-series prediction while constraining the network structure.
Then, we compare the learning performance of QMKRSL and QMKRSL_BG with that of the original MKRSL, QKLMS, and QKMC, in an effort to demonstrate the superiority of KRSL function and online VQ method.In the simulations below, considering the prediction accuracy and the computational cost simultaneously, we select γ as 0.2 to make the network size reduce to a modest range, on the basis of the analysis for those results in Figures 3 and 4.Then, the final network sizes for the Gaussian, uniform, Bernouli, and sine wave noise cases are reduced to 2455, 2048, 1373, and 1555, respectively.Therefore, only 2455, 2048, 1373, and 1555 items of training data are chosen as network centers from 4000 input samples for the final filter network.
Figure 5 shows the learning curves of these algorithms under different noise environments.It is clear that the testing MSE of the QMKRSL is smaller than that of the QKLMS and the QKMC in all four types of noise environments, and the QMKRSL_BG performs better than all other four algorithms.It indicates that the efficient performance surface of KRSL function can make the QMKRSL outperform the QKMC and QKLMS; the online VQ method using a bilateral gradient technique can obtain more hidden information in the system, and thus the QMKRSL_BG can achieve a higher filtering accuracy than the QMKRSL and other algorithms.Let ε be the required prediction accuracy.If the prediction error is less than the threshold ε, the error is within acceptable range and thus the prediction is successful.Otherwise, the prediction fails.With the same noise model used in Figure 5d, the successful prediction rates of these algorithms with changeable threshold ε are shown in Figure 6.It can be seen that the larger the threshold ε, the higher the successful rate.Moreover, the successful prediction rate of QMKRSL is similar to that of the MKRSL, and the successful prediction rate of the QMKRSL_BG is higher than other algorithms.This result also verifies that the QMKRSL can maintain the filtering accuracy while effectively limiting the network size, and the QMKRSL_BG can further improve the filtering accuracy.Hence, in consideration of the final network size and prediction effect simultaneously, the QMKRSL and QMKRSL_BG proposed in this article may be two competitive choices to constrain the network size as well as achieve better prediction performance.

Conclusions
On the basis of KRSL criterion, an effective nonlinear KAF algorithm called the MKRSL has been derived to a achieve better filtering performance in non-Gaussian noise environments.To constrain the linearly growing RBF network structure of MKRSL, the quantization technique is incorporated into the original MKRSL, and thus the QMKRSL is proposed in this article.The simulation results of chaotic time-series show that QMKRSL can effectively reduce network size while maintaining simplicity and prediction accuracy.Considering the difference between the samples in a same quantization region, an intensified QMKRSL using the bilateral gradient technique is accordingly developed to further improve the accuracy, and it is named the QMKRSL_BG.In contrast to the traditional gradient descent method, the bilateral gradient descent approach can update the coefficient vector of centers and desired output of samples in the same quantization region simultaneously.This property allows the QMKRSL_BG to get more hidden information from the input and output spaces, and thus achieves higher accuracy.Simulation results have verified the theoretical expectations and demonstrated that the QMKRSL_BG can achieve better filtering performance than some other quantized algorithms.In practical scenarios, computing devices usually have limited computational capacity and physical memory, and are often disturbed by non-Gaussian noises.Therefore, how to effectively constrain the network size as well as achieve higher accuracy in non-Gaussian noise environments is a problem to be solved.In consideration of those facts mentioned above, the proposed algorithms may be the good choices for nonlinear adaptive learning tasks in realistic scenarios.
Due to the smoothness and strict positive-definiteness of Gaussian function, the kernel used in this article defaults to the Gaussian kernel.However, Gaussian kernel is not always the best choice in some cases.Therefore, how to select a better kernel function appropriately is an interesting subject for future study.In addition, as a key parameter in quantized KAF algorithms, the quantization size is determined as a tradeoff between prediction accuracy and computational complexity (network size).Thus, how to select the quantization size appropriately and adaptively is also an important issue to be solved in the future.Moreover, how to achieve good performance when the final network size is fixed in advance is another interesting and important future research topic along this direction.

Figure 1 .
Figure 1.Network topology for quantization operation in the i-th iteration computation of the quantized minimum kernel risk-sensitive loss (QMKRSL).VQ: vector quantization.

Figure 2 .
Figure 2. The portion of the processed Lorenz time-series.

Figure 3 .
Figure 3.The effect of the quantization size γ on final network size under different noise environments using QMKRSL.