Kernel Risk-Sensitive Mean p-Power Error Algorithms for Robust Learning

As a nonlinear similarity measure defined in the reproducing kernel Hilbert space (RKHS), the correntropic loss (C-Loss) has been widely applied in robust learning and signal processing. However, the highly non-convex nature of C-Loss results in performance degradation. To address this issue, a convex kernel risk-sensitive loss (KRL) is proposed to measure the similarity in RKHS, which is the risk-sensitive loss defined as the expectation of an exponential function of the squared estimation error. In this paper, a novel nonlinear similarity measure, namely kernel risk-sensitive mean p-power error (KRP), is proposed by combining the mean p-power error into the KRL, which is a generalization of the KRL measure. The KRP with p=2 reduces to the KRL, and can outperform the KRL when an appropriate p is configured in robust learning. Some properties of KRP are presented for discussion. To improve the robustness of the kernel recursive least squares algorithm (KRLS) and reduce its network size, two robust recursive kernel adaptive filters, namely recursive minimum kernel risk-sensitive mean p-power error algorithm (RMKRP) and its quantized RMKRP (QRMKRP), are proposed in the RKHS under the minimum kernel risk-sensitive mean p-power error (MKRP) criterion, respectively. Monte Carlo simulations are conducted to confirm the superiorities of the proposed RMKRP and its quantized version.


Introduction
Online kernel-based learning is to extend the kernel methods to online settings where the data arrives sequentially, which has been widely applied in signal processing thanks to its excellent performance in addressing nonlinear issues [1]. The development of kernel methods is of great significance for practical applications. In kernel methods, the input data are transformed from the original space into the reproducing kernel Hilbert space (RKHS) using the kernel trick [2]. As the representative of the kernel methods, kernel adaptive filters (KAFs) provide an effective way to transform a nonlinear problem into a linear one, which have been widely introduced in system identification and time-series prediction [3][4][5]. Generally, KAFs are designed for Gaussian and non-Gaussian noises from the aspect of cost function, respectively.
For Gaussian noises, the second-order similarity measures of errors are generally used as a cost function of KAFs to achieve desirable filtering accuracy. Therefore, in the Gaussian noise environment, KAFs based on the second-order similarity measures of errors are mainly divided into three categories, i.e., the kernel least mean square (KLMS) algorithm [6], the kernel affine

Definition
According to [17], the risk-sensitive loss is defined in RKHS, called the kernel risk-sensitive loss (KRL). Given two arbitrary scalar random variables X and Y, where X, Y ∈ R, the KRL is defined by where λ > 0 is a risk-sensitive scalar parameter; ϕ(X) = κ(X, .) is a nonlinear mapping induced by a Mercer kernel κ σ (.), which transforms the data from the original space into the RKHS F equipped with an inner product ., . F satisfying ϕ(X), ϕ(Y) F = ϕ T (X)ϕ(Y) = κ σ (X − Y); E denotes the mathematical expectation; . F denotes the norm in RKHS F; F XY (x, y) denotes the joint distribution function of (X, Y). A shift-invariant Gaussian kernel κ σ (.) with bandwidth σ is given as follows: However, the joint distribution of (X, Y) is usually unknown, and only N samples {x(i), y(i)} N i=1 are available. Hence, the nonparametric estimate of L λ (X, Y) is obtained by applying the Parzen windows [19] asL λ (X, Y) = 1 Nλ N ∑ i=1 exp(λ(1 − κ σ (x(i) − y(i)))). Note that the inner product in the RKHS for the same input is calculated by using kernel trick and (2), i.e, ϕ T (X)ϕ(X) = exp − (X−X) 2 2σ 2 = 1.
In this paper, we define a new non-second order similarity measure in the RKHS, i.e., the kernel risk-sensitive mean p-power error (KRP) loss. Given two random variables X and Y, the KRP loss is defined by where p > 0 is the power parameter. Note that the KRL can be regarded as a special case of the KRP with p = 2.
However, the joint distribution of X and Y is usually unknown in practice. Hence, the empirical KRP is defined as follows: where {x(i), y(i)} N i=1 denotes the available finite number of samples. The empirical KRP can be regarded as a distance between both X = [x(1), x(2), ..., x(N)] T and Y = [y(1), y(2), ..., y(N)] T .

Properties
In the following, we give some important properties of the proposed KRP.
Therefore, we can obtain Proof. Since exp(x) is approximated by 1 + x for a small enough x, for the case of large enough σ, i.e., Thus, we can obtain the approximation as Similarly, when λ (X−Y) 2 2σ 2 p/2 → 0 for large enough σ, we can also obtain the approximation as According to (7) and (8), we have Remark 1. According to Properties 3 and 4, the KRP is, approximately, equivalent to the KMPE [27] as λ is small enough, and equivalent to the MPE [15] as σ is large enough. Thus, the KMPE and MPE can be viewed as two extreme cases of the KRP.
where 0 denotes an N-dimensional zero vector.
Remark 3. According to Property 7, the empirical KRPL λ,p (X, 0) behaves like an L p norm of X when kernel bandwidth σ is large enough.

Application to Adaptive Filtering
In this section, to combat non-Gaussian noises, two recursive robust adaptive algorithms under the proposed KRP criterion are proposed in the RKHS using the kernel trick and vector quantization technique, respectively.

RMKRP
The recursive strategy is introduced into the KRP loss function, namely the recursive minimum kernel risk-sensitive mean p-power error (RMKRP) algorithm. The offline solution to minimum of the KRP loss is first obtained. Based on the obtained offline solution, the recursive solution or online solution to minimum of the KRP loss is then derived using some matrix operations, which generates the RMKRP algorithm. The details of RMKRP are shown as follows.
Consider the prediction of a continuous input-output model f : U → R based on adaptive filtering shown in Figure 1 j=1 is used to perform the prediction of f (·) in an adaptive filter. The nonlinear mapping ϕ(u(j)) of input u(j) is denoted by ϕ(j) for simplicity. Hence, in the RKHS F, the training samples are changed to {Φ(i), d(i)}, where the desired output vector is d(i) = [d(1), d(2), ..., d(i)] T and the input kernel mapping matrix is Φ(i) = [ϕ(1),ϕ(2), ...,ϕ(i)]. The prediction denoted byf (·) in the RKHS is therefore given asf (·) = ϕ T (·)Ω, where Ω ∈ F is the weight vector in a high dimensional feature space F. An exponentially-weighted loss function is used here to put more emphasis on recent data and to de-emphasize data on the remote past [28]. When {Φ(i), d(i)} are available, the weight vector Ω(i) is obtained as the offline solution to minimizing the following weighted cost function: where ρ denotes the forgetting factor in the interval [0, 1], ζ is the regularization factor, z(j) = 1 − exp − e 2 (j) 2σ 2 , and e(j) = d(j) − ϕ T (j)Ω(i) denotes the jth estimate error. The second term is a norm penalizing term, which is to guarantee the existence of the inverse of the input data autocorrelation matrix especially during the initial update stages. In addition, the regularization term is weighted by ρ, which deemphasizes regularization as time progresses. According to Property 6, the empirical KRP as a function of e is convex at any point satisfying max j=1,2,...,i |e(j)| ≤ σ, λ > 0, and p ≥ 2. To obtain the minimum of (15), its gradient is calculated, i.e., Setting (16) to zero, i.e., ∂J(Ω) ∂Ω = 0, we can obtain the offline solution to minimum of (15) as follows: where (1 − z(j)), j = 1, 2, ..., i, and I denotes an identity matrix with an appropriate dimension.
To obtain an efficient recursive solution to the minimum of (15), a Mercer kernel is used to construct the RKHS. Here, the Gaussian kernel is used as a Mercer kernel, which is denoted as κ σ 1 (.) with σ 1 being the kernel width. The inner product in the RKHS can be calculated by using the kernel trick [28], i.e., κ σ 1 (u(i), u(j)) = κ σ 1 (u(i) − u(j)) = ϕ T (u(i))ϕ(u(j)) = ϕ T (i)ϕ(j), efficiently, which can avoid the direct calculation of nonlinear mapping ϕ(·).
Since the matrix inversion lemma [28] is described by (17) as Substituting (18) into (17) yields Note that Φ T (i)Φ(i) in (19) can be computed by the kernel trick, efficiently. The weight vector Ω(i) is therefore described explicitly as a linear combination of the input data in the RKHS, i.e., where α(i) denotes the coefficients vector. It can be seen from (20) that the recursive form of Ω(i) is changed to that of α(i). Hence, in the following, the key for finding a recursive solution to the minimum of (15) is to obtain the recursive form of α(i).
The coefficients vector α(i) is calculated using the kernel trick as For simplicity, we obtain the update form of α(i) indirectly by defining Λ(i) as where Then, the update form of Λ(i) can be further obtained where . By using some matrix operations, we further simplify (23) as where ξ(i) = Φ T (i − 1)ϕ(i). By using the following block matrix inversion identity [18,21,28] A C then, we can obtain the update equation for the inverse of the growing matrix in (24) as where θ(i) = Λ(i − 1)ξ(i) and r(i) = ρ i ζ2σ 2 /pν(i) + ϕ T (i)ϕ(i) − θ T (i)ξ(i). Combining (21) with (26), the coefficients vector α(i) of the weight vector Ω(i) is shown as follows: where e(i) = d(i) −f (i) denotes the difference between the desired output d(i) and the system output is the jth element of α(i − 1) and all the previous data are the centers. The coefficients α(i − 1) and all the previous data should be stored at each iteration. Finally, the RMKRP algorithm is summarized in Algorithm 1.

QRMKRP
The RMKRP algorithm generates a linearly growing network owing to the used kernel trick. The online vector quantization (VQ) method [12] has been successfully applied in KAFs to curb its network growth efficiently. Thus, we incorporate the online VQ method into the RMKRP to develop the quantized RMKRP (QRMKRP) algorithm, which is shown as follows.
Suppose that the dictionary C(i) contains L vectors at discrete time i, i.e., C(i) = {C k (i)} L k=1 , k ∈ Id = {1, 2, ..., L}, which means that there are L distinctive quantization regions. In the RKHS, the predictionf (i) is therefore expressed asf (i) = ϕ T (C k (i))Ω, whereΩ ∈ F is the weight vector in RKHS F. The cost function of QRMKRP based on C(i) is denoted as where |D k | denotes the number of input data those lie in the kth quantization region of C(i) and satisfies ∑ k∈Id |D k | = i and |D k | ≥ 1, and d kn (i) is the desired output d(i) corresponding to the nth element within the kth quantization region. The offline solution to the minimization of (28) can be described bŷ whereΦ(i) = [ϕ(C 1 (i)), ϕ(C 2 (i)), ..., ϕ(C L (i))] with L i elements; T denotes a accumulated weighted output vector; H kn (i) denotes H i (i) corresponding to the nth entry of the kth quantization region; I denotes an identity matrix with an appropriate dimension. Since (29) has a similar form to (17), we simplify (29) aŝ whereK(i) =Φ T (i)Φ(i). To obtain the recursive solution to the minimization of (28), we letα(i) = Q(i)d(i) and denoteQ To updateΩ(i) in (30) recursively, two cases are therefore considered.
(1) First, Case: dis(u(i), C(i − 1)) ≤ : In this case, we have C(i) = C(i − 1) andQ(i) =Q(i − 1), which means the input u(i) is therefore quantized to the k * th element of dictionary C(i − 1), where k * = arg min The matrixĤ(i) and the vectord(i) have a similar form to [13].
where k * = arg min

Simulation
In this section, to validate the performance of the proposed RMKRP algorithm and its quantized version, two examples, i.e., Mackey-Glass (MG) chaotic time series prediction and nonlinear system identification, are used to validate the performance superiorities of the proposed two algorithms.
In this example, the noise environment considered is the impulsive noise, which is modeled by the combination of two independent noise processes [17], i.e., where v 1 (i) is an ordinary noise disturbance with small variance and v 2 (i) represents large outliers with large variance; b(i) is of binary distribution random process over {0, owing to its heavy-tailed probability density function. The α-stable process is described by the following characteristic function [29]: where with α ∈ (0, 2] being the characteristic factor, β ∈ [−1, 1] being the symmetry parameter, γ > 0 being the dispersion parameter, sgn(.) denotes the sign function, j = √ −1, and −∞ < δ < ∞ being the location parameter. Generally, a smaller α generates a heavier tail and a smaller γ generates fewer large outliers. The characteristic function denoted as V α−stable (α, β, γ, δ) is chosen as V α−stable (0.8, 0, 0.1, 0) to model the impulse noise in the simulations.

Chaotic Time Series Prediction
The MG chaotic time series is generated from the following differential equation [9]: where β, γ, n > 0. Here, we set β = 0.2, γ = 0.1, and τ = 30. The time series is discretized at a sampling period of six seconds. The training set includes a segment of 2000 samples corrupted by the additive noises which are shown in (39), and another 200 samples without noise are used as the testing set. The kernel size σ 1 in the Gaussian kernel is set to 1. The filter length is set at L = 7, which means that [x t , x t−1 , ..., x t−6 ] is used to predict x t+1 . To evaluate the filtering accuracy, the testing mean square error (MSE) is defined as follows: wheref (i) is the estimate of d(i), and N is the length of testing data. The KLMS [6], KMCC [22], MKRL [26], KRMC [21], and KRLS [8] algorithms are chosen for performance comparison with RMKRP thanks to their excellent filtering performance. The other sparsification algorithms, i.e., the QKLMS [12], QKMCC [30], QMKRL [26], QKRLS [13], and KRMC-NC [21] algorithms are used for performance comparison with QRMKRP owing to their modest space complexities and excellent performance. All simulation results are averaged over 50 independent Monte Carlo runs.
Since power parameter p, risk-sensitive parameter λ, and kernel width σ are crucial parameters in the proposed RMKRP and QRMKRP algorithms, the influence of these parameters on the performance is first discussed. In the simulations, we take 12 points evenly in the close interval p ∈ [1, 6] and σ ∈ [0.17, 5], respectively. The influence of p on the steady-state performance of RMKRP is shown in Figure 2a, where the steady-state MSEs are obtained as averages over the last 100 iterations. The parameters are set as: p is set within [1, 6]; risk-sensitive parameter λ in the KRP is set as 1; ζ = 0.1 and ρ = 1; kernel size σ in the KRP is set as 1. As can be seen from Figure 2a, we have that the filtering accuracy of RMKRP is the highest when p = 4 and decreases gradually when p is either too small or too large. Then, the influence of σ on the filtering performance of RMKRP with p = 4 is shown in Figure 2b, where the steady-state MSEs are obtained as averages over the last 100 iterations. The parameters are set as: risk-sensitive parameter λ is fixed at 1; σ lies in [0. 17,5]. From Figure 2b, we see that RMKRP can achieve the highest filtering accuracy when σ is about 1. It is reasonable to note that RMKRP are sensitive to outliers when the kernel width is large, and decreases its ability of error-correction when the kernel width is small. Finally, the influence of λ on the filtering performance of RMKRP with σ = 1 and p = 4 is shown in Figure 2c, where the steady-state MSEs are obtained as averages over the last 100 iterations. The parameters are set as: the range of λ is selected as λ ∈ {0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1, 2, 3, 4}. From Figure 2c, we see that λ has a slight influence on the filtering accuracy when λ is small, and a large λ can increase the steady-state MSE obviously. Therefore, from Figure 2, the parameters of RMKRP can be chosen by trials to obtain the best performance in practice. Similarly, the parameters of QRMKRP can be chosen by the same method as that in RMKRP. The performance comparison of QKLMS, QKMCC, QMKRL, KLMS, KMCC, MKRL, KRLS, KRMC, KRMC-NC, and QKRLS is conducted in the same environments as in (39). The parameters of the proposed algorithms are selected by trials to achieve desirable performance, and the parameters of compared algorithms are chosen such that they have almost the same convergence rate. λ = 1, p = 4, and σ = 1 are set for RMKRP; λ = 1, p = 4, σ = 1, and = 0.2 for QRMKRP; η= 0.1 for KLMS; η= 0.09 and σ = 3.5 for KMCC; η= 0.09, σ = 1, and λ = 1 for MKRL; η= 0.1 and = 0.2 for QKLMS; η= 0.09, = 0.2, and σ = 3.5 for QKMCC; η= 0.09, = 0.2, σ = 1, and λ = 1 for QMKRL; ζ = 0.1, ρ = 1, and σ = 3.5 for KRMC; the novelty criterion thresholds δ 1 = 0.15, δ 2 = 0.1, ζ = 0.1, ρ = 1, and σ = 3.5 for KRMC-NC; ζ = 0.1 for KRLS; ζ = 0.1 and = 0.2 for QKRLS. Figure 3 shows the compared MSEs of RMKRP, QRMKRP, and the compared algorithms. As can be seen from Figure 3, RMKRP achieves a better filtering accuracy than KRLS, KRMC, KLMS, KMCC, and MKRL. QRMKRP achieves a better steady-state testing MSE than the sparsification algorithms including QKRLS, KRMC-NC, QKLMS, QKMCC, and QMKRL. We also see from Figure 3 that the proposed algorithms provide good robustness to impulsive noises. For detailed comparison, the dictionary size, consumed time, and steady-state MSEs in Figure 3 are shown in Table 1. Note that the steady-state MSEs of KLMS, QKLMS, KRLS, and QKRLS are not shown in Table 1 since they cannot converge in such impulsive noise environment. From Table 1, we see that RMKRP has similar consumed time to KRLS and KRMC but provides better filtering accuracy. In addition, QRMKRP provides the highest filtering accuracy in all the compared sparsification algorithms and approaches the filtering accuracy of RMKRP with a significantly lower network size.
The training set includes a segment of 2000 samples corrupted by the additive noises shown in (39), and another 200 samples without noise are used as the testing set. The kernel width σ 1 is set to 1 for the Gaussian function. All simulation results are averaged over 50 independent Monte Carlo runs.
Similar to MG chaotic time series prediction, the influence of power parameter p, risk-sensitive parameter λ, and kernel width σ on the performance of RMKRP is also discussed in nonlinear system identification. The influence of p on the steady-state performance of RMKRP is shown in Figure 4a, where the steady-state MSEs are obtained as averages over the last 100 iterations. The parameters are set as: p is set within [1, 6]; λ is set as 0.1; ζ = 0.1 and ρ = 1; kernel size σ in the KRP is set as 1. The influence of σ on the filtering performance of RMKRP is shown in Figure 4b, where risk-sensitive parameter λ is fixed at 0.1; σ lies in [0. 17,5]; p is set as 4. The influence of λ on the filtering performance of RMKRP is shown in Figure 4c, where the range of λ is selected as λ ∈ {0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1, 2, 3, 4}; σ is set as 1; p is set as 4. As can be seen from Figure 4, we can obtain the same conclusions as those in Figure 2.  We compare the filtering performance of QKLMS, QKMCC, QMKRL, KLMS, KMCC, MKRL, KRLS, KRMC, KRMC-NC, and QKRLS in the same environments as in (39). The parameters of the proposed algorithms are selected by trials to achieve desirable performance, and the parameters of compared algorithms are chosen such that they have almost the same convergence rate. λ = 0.1, p = 4, and σ = 1 are set for RMKRP; λ = 0.1, p = 4, σ = 1, and = 0.2 for QRMKRP; η= 0.1 for KLMS; η= 0.09 and σ = 3.5 for KMCC; η= 0.09, σ = 1, and λ = 2 for MKRL; η= 0.1 and = 0.2 for QKLMS; η= 0.09, = 0.2, and σ = 3.5 for QKMCC; η= 0.09, = 0.2, σ = 1, and λ = 2 for QMKRL; ζ = 0.1, ρ = 1, and σ = 3.5 for KRMC; the novelty criterion thresholds δ 1 = 0.01, δ 2 = 0.1, ζ = 0.1, ρ = 1, and σ = 3.5 for KRMC-NC; ζ = 0.1 for KRLS; ζ = 0.1 and = 0.2 for QKRLS. Figure 5 shows the compared MSEs of RMKRP, QRMKRP, and the compared algorithms. For detailed comparison, the dictionary size, consumed time, and steady-state MSEs in Figure 5 are also shown in Table 2, where the steady-state MSEs of KLMS, QKLMS, KRLS, and QKRLS are not shown since they cannot converge in such impulsive noise environments. From Figure 5 and Table 2, we can obtain the same conclusions as those in Figure 3 and Table 1.

Conclusions
In this paper, the kernel risk-sensitive mean p-power error (KRP) criterion is proposed by constructing mean p-power error (MPE) into kernel risk-sensitive loss (KRL) in RKHS, and some basic properties are presented. The KRP criterion with power parameter p is more flexible than KRL to handle the signal corrupted by impulsive noises. Two kernel recursive adaptive algorithms are derived to obtain desirable filtering accuracy under the minimum KRP (MKRP) criterion, i.e., the recursive minimum KRP (RMKRP) and quantized RMKRP (QRMKRP) algorithms. The RMKRP can achieve higher accuracy but with almost identical computational complexity as that of the KRLS and KRMC. The vector quantization method is introduced into RMKRP, thus generating QRMKRP, and QRMKRP can effectively reduce network size while maintaining the filtering accuracy. Simulations conducted in Mackey-Glass (MG) chaotic time series prediction and nonlinear system identification under impulsive noises illustrate the superiorities of RMKRP and QRMKRP from the aspects of robustness and filtering accuracy.