The Cauchy Conjugate Gradient Algorithm with Random Fourier Features

Huang, Xuewei; Wang, Shiyuan; Xiong, Kui

doi:10.3390/sym11101323

Open AccessArticle

The Cauchy Conjugate Gradient Algorithm with Random Fourier Features

by

Xuewei Huang

^1,2,

Shiyuan Wang

^1,2,*

and

Kui Xiong

^1,2

¹

College of Electronic and Information Engineering, Southwest University, Chongqing 400715, China

²

Chongqing Key Laboratory of Nonlinear Circuits and Intelligent Information Processing, Chongqing 400715, China

^*

Author to whom correspondence should be addressed.

Symmetry 2019, 11(10), 1323; https://doi.org/10.3390/sym11101323

Submission received: 23 September 2019 / Revised: 10 October 2019 / Accepted: 17 October 2019 / Published: 22 October 2019

Download

Browse Figures

Versions Notes

Abstract

:

Random Fourier mapping (RFM) in kernel adaptive filters (KAFs) provides an efficient method to curb the linear growth of the dictionary by projecting the original input data into a finite-dimensional space. The commonly used measure in RFM-based KAFs is the minimum mean square error (MMSE), which causes performance deterioration in the presence of non-Gaussian noises. To address this issue, the minimum Cauchy loss (MCL) criterion has been successfully applied for combating non-Gaussian noises in KAFs. However, these KAFs using the well-known stochastic gradient descent (SGD) optimization method may suffer from slow convergence rate and low filtering accuracy. To this end, we propose a novel robust random Fourier features Cauchy conjugate gradient (RFFCCG) algorithm using the conjugate gradient (CG) optimization method in this paper. The proposed RFFCCG algorithm with low complexity can achieve better filtering performance than the KAFs with sparsification, such as the kernel recursive maximum correntropy algorithm with novelty criterion (KRMC-NC), in stationary and non-stationary environments. Monte Carlo simulations conducted in the time-series prediction and nonlinear system identification confirm the superiorities of the proposed algorithm.

Keywords:

random Fourier mapping; minimum Cauchy loss; non-Gaussian noise; conjugate gradient; complexity

1. Introduction

Many applications in the real world, such as system identification, regression, and online kernel learning (OKL) [1], require complex nonlinear models. The kernel method using a Mercer kernel has attracted interests in tackling these complex nonlinear applications, which transforms nonlinear applications into linear ones in the reproducing kernel Hilbert space (RKHS) [2]. Developed in RKHS, a kernel adaptive filter (KAF) [2] is the most celebrated subfield of OKL algorithms. Using the simplest stochastic gradient descent (SGD) method for learning, KAFs including the kernel least mean square (KLMS) algorithm [3], kernel affine projection algorithm (KAPA) [4], and kernel recursive least squares (KRLS) algorithm [5] have been proposed.

However, allocating a new kernel unit as a radial basis function (RBF) center with the coming of new data, the linearly growing structure (called “dictionary” hereafter) will increase the computational and memory requirements in KAFs. To curb the growth of the dictionary, two categories are chosen for sparsification. The first category accepts only informative data as new dictionary centers by using a threshold, including the surprise criterion (SC) [6], the coherence criterion (CC) [7], and the vector quantization (VQ) [8]. However, these methods cannot fully address the growing problem and still introduce additional time consumption at each iteration. The fixed points methods as the second category, including the fixed-budget (FB) [9], the sliding window (SW) [10], and the kernel approximation methods (e.g., the Nystr

\ddot{o}

m method [11] and random Fourier features (RFFs) method [12]), are used to overcome the sublinearly growing problem. However, the FB method and the SW method cannot guarantee a good performance in specific environments with a small amount of time [13]. Compared with the Nystr

\ddot{o}

m method, RFFs are drawn from a distribution that is randomly independent from the training data. Due to a data-independent vector representation, RFFs can provide a good solution to non-stationary circumstances. On the basis of RFFs, random Fourier mapping (RFM) is proposed by mapping input data into a finite-dimensional random Fourier features space (RFFS) using a randomized feature kernel’s Fourier transform in a fixed network structure. The RFM alleviates the computational and storage burdens of KAFs, and ensures a satisfactory performance under non-stationary conditions. The examples for developing KAFs with RFM are the random Fourier features kernel least mean square (RFFKLMS) algorithm [13], random Fourier features maximum correntropy (RFFMC) algorithm [14], and random Fourier features conjugate gradient (RFFCG) algorithm [15].

For the loss function, due to their simplicity, smoothness, and mathematical tractability, the second-order statistical measures (e.g., minimum mean square error (MMSE) [2] and least squares [16]) are widely utilized in KAFs. However, KAFs based on the second-order statistical measures are sensitive to non-Gaussian noises including the sub-Gaussian and super-Gaussian noises, which means that their performance may be seriously degraded if the training data are contaminated by outliers. To handle this issue, robust statistical measures have therefore gained more attention, among which the lower-order error measure [17] and the higher-lower error measure [18] are two typical examples. However, the higher-order error measure is not suitable for the mixture of Gaussian and super-Gaussian noises (Laplace,

α

-stable, etc.) with poor stability and astringency, and the lower-order measure of error is usually more desirable in these noise environments with slow convergence rate. Recently, the information theoretic learning (ITL) [19] similarity measures, such as the maximum correntropy criterion (MCC) [20] and minimum error entropy criterion (MEE) [19], have been introduced to implement robust KAFs. The ITL similarity measures have been shown to have a strong robustness against non-Gaussian noises at the expense of increasing computational burden in training processing. In addition, minimizing the logarithmic moments of the error, the logarithmic error measure—including the Cauchy loss (CL) [21] with low computational complexity—is an appropriate measure of optimality. Using the Cauchy loss to penalize the noise term, some algorithms based on the minimum Cauchy loss (MCL) criterion are efficient for combating non-Gaussian noises, especially for heavy-tailed

α

- stable noises.

From the aspect of the optimization method, the stochastic gradient descent (SGD)-based algorithms cannot find the minimum using the negative gradient in some loss functions [20,21,22]. Toward this end, recursive-based algorithms [23] address these issues at the cost of increasing computational cost. In comparison with the SGD method and recursive method, the conjugate gradient (CG) method [24,25,26] and Newton’s method as developments of SGD have become alternative optimization methods in KAFs. The inverse of matrix of Newton’s method increases the computation and causes the divergence of algorithms in some cases [22]. However, the CG method gives a trade-off between convergence rate and computational complexity without the inverse computation, and has been successfully applied in various fields, including compressed sensing [27], neural networks [28], and large-scale optimization [29]. In addition, the kernel conjugate gradient (KCG) method is proposed [30] for adaptive filtering. KCG with low computational and space requirements can produce a better solution than KLMS, and has comparable accuracy to KRLS.

In this paper, to reduce the computational complexity, we apply the RFM in the MCL-based KAF to address the problem of linear growth and improve the robustness. Further, the CG optimization method is used to improve the filtering accuracy and convergence rate, developing a novel robust random Fourier features Cauchy conjugate gradient (RFFCCG) algorithm. The contributions of this paper are summarized as follows. 1) Inspired by the finite-dimensional RFM and MCL criterion, a novel RFFCCG algorithm is derived by mapping the original input data into the fixed-dimensional RFFS, which can significantly solve the problem of the growth of network structure and improve robustness compared to other robust algorithms in the context of non-Gaussian noises. 2) By applying the CG method, RFFCCG with low computational and space complexities provides good filtering accuracy against non-Gaussian noises. The computational and space complexities of RFFCCG are also discussed. 3) The proposed algorithm can also achieve excellent tracking performance when a system has a sudden change.

The rest of this paper is structured as follows. The MCL criterion and its convexity are described in Section 2, and the online CG algorithm is also briefly reviewed in this section. In Section 3, we present the proposed RFFCCG algorithm and its complexity analysis. Illustrative simulations in the presence of non-Gaussian noises are presented to confirm the effectiveness of the proposed algorithm in Section 4. Finally, Section 5 gives the concluding remarks of this paper.

2. Background

In this section, we first briefly review the minimum Cauchy loss (MCL) criterion. The performance surfaces of Cauchy loss (CL) and mean square error (MSE) are also compared. Then, the conjugate gradient (CG) method and its online algorithm are introduced.

2.1. Minimum Cauchy Loss Criterion

Given two random variables X and Y, the Cauchy loss [21] is defined as:

\begin{matrix} V (X, Y) & = E [ln (1 + \frac{{(X - Y)}^{2}}{γ^{2}})] \\ = \int ln (1 + \frac{{(X - Y)}^{2}}{γ^{2}}) d F_{X Y} (x, y), \end{matrix}

(1)

where

E [\cdot]

denotes the mathematical expectation,

F_{X Y} (x, y)

denotes the joint distribution function of

(X, Y)

, and

γ > 0

is a constant. Since

F_{X Y} (x, y)

is usually unknown, it is difficult to calculate

V (X, Y)

directly. In practice, given a finite number of samples

{x_{k}, y_{k}}_{k = 1}^{N}

, (1) can be approximated as follows:

\begin{matrix} \hat{V} (X, Y) = \frac{1}{N} \sum_{k = 1}^{N} ln (1 + \frac{{(x_{k} - y_{k})}^{2}}{γ^{2}}), \end{matrix}

(2)

which is called the empirical CL. In addition, (2) can also be regarded as a generalized error between two vectors

X = {[x_{1}, x_{2}, \dots, x_{N}]}^{T}

and

Y = {[y_{1}, y_{2}, \dots, y_{N}]}^{T}

.

Let

e = X - Y = {[e_{1}, e_{2}, \dots, e_{N}]}^{T}

, where

e_{k} = x_{k} - y_{k}

,

k = 1, 2, \dots N

. Because of (2), the Hessian matrix of

\hat{V} (X, Y)

with respect to

e

is expressed as:

\begin{matrix} H_{\hat{V} (X, Y)} (e) = [\frac{\partial^{2} \hat{V} (X, Y)}{\partial e_{k} \partial e_{j}}] = diag [ξ_{1}, ξ_{2}, . . ., ξ_{N}], \end{matrix}

(3)

where

\begin{matrix} ξ_{k} = \frac{2 (γ^{2} - e_{k}^{2})}{{(γ^{2} + e_{k}^{2})}^{2}}, k = 1, 2, \dots, N . \end{matrix}

(4)

We have that

H_{\hat{V} (X, Y)} (e) \geq 0

when

|e_{k}| \leq γ

, that is, the empirical CL is convex at

|e_{k}| \leq γ

. A larger

γ

results in a larger convex range in general.

The optimal solution to the Cauchy loss function can be obtained by solving the following optimization problem:

\begin{matrix} min \sum_{k = 1}^{N} ln (1 + \frac{{(d_{k} - y_{k})}^{2}}{γ^{2}}), \end{matrix}

(5)

which is called the minimum Cauchy loss (MCL) criterion. Figure 1 shows the comparison of MSE and CL functions with different

γ

. We can observe that: 1) compared with the MSE loss function, the CL function maintains the insensitivity to large errors. Thus, adaptive filters using the CL function will be robust against large outliers. 2) The value of

γ

can control the shape of the CL function, and a larger

γ

can generate a smoother curve when the error is smaller, which means that the CL function can provide good smoothness to the steady-state error. In practice, we choose an appropriate

γ

to guarantee the robustness and convexity of the CL function.

2.2. Conjugate Gradient Algorithm

The typical conjugate direction method is the conjugate gradient (CG) method, which is developed by selecting the successive direction vectors as conjugate versions of the successive gradients. The CG method is introduced to a linear adaptive filter in [24,26], which is used to solve the following linear equation:

\begin{matrix} R ω = b \end{matrix}

(6)

and (6) is also equivalent to solve the following purely quadratic function:

\begin{matrix} f (ω) = \frac{1}{2} ω^{T} R ω - b^{T} ω . \end{matrix}

(7)

where

ω \in R^{n}

is the weight vector,

b \in R^{n}

is the cross-correlation vector, and

R \in R^{n \times n}

is a symmetric positive definite auto-correlation matrix. To find the optimal

ω

, the CG method—which is described as follows—provides an alternative method instead of estimating

R^{- 1}

. Beginning with any

ω_{0} \in R^{n}

, and direction

p_{0} = - g_{0} = b - R ω_{0}

, the global minimum to (6) can be derived by iteratively computing

\begin{matrix} \{\begin{matrix} α_{k} = - \frac{g_{k}^{T} p_{k}}{p_{k}^{T} R p_{k}} \\ ω_{k + 1} = ω_{k} + α_{k} p_{k} \\ g_{k + 1} = b - R ω_{k + 1} \\ β_{k} = \frac{g_{k + 1}^{T} g_{k + 1}}{g_{k}^{T} g_{k}} \\ p_{k + 1} = - g_{k + 1} + β_{k} p_{k} \end{matrix}, \end{matrix}

(8)

where

{(\cdot)}^{T}

denotes the transpose,

g_{k}

is the gradient of

f (ω_{k})

with respect to

ω

at discrete time k, the step-size

α_{k}

is given by

arg min f (ω_{k} + α p_{k})

, and constant

β_{k}

is selected to provide

R

-conjugacy for vector

p_{k}

regarding the previous direction vectors

p_{k - 1}, p_{k - 2}, \dots, p_{0}

. In (8), note that the new conjugate direction

p_{k}

is formed by a linear combination of the current negative gradient

- g_{k}

and the previous direction vectors with a proper

β_{k}

to avoid manually resetting the direction vector.

The CG method can be regarded as a trade-off between the stochastic gradient descent (SGD) method and Newton’s method in terms of convergence rate and the complexities of computation and storage. Especially, the inverse of the Hessian matrix—which will cause the divergence of the algorithm—is avoided in the CG method. Therefore, the CG method is widely applicable to address quadratic optimization problems. In addition, it can also be extended to approximate non-quadratic optimization problems.

2.3. Online Conjugate Gradient Algorithm

The aforementioned CG method is generally used for offline applications. In this section, the online CG algorithm is given for online learning, where training data arrive sequentially [1,30].

It is necessary to estimate

R

and

b

for online applications. Using the exponentially decaying data window [2], the following recursive form to update

R

and

b

is given by

\begin{matrix} R_{k + 1} & = λ R_{k} + x_{k + 1} x_{k + 1}^{T} \\ b_{k + 1} & = λ b_{k} + d_{k + 1} x_{k + 1}, \end{matrix}

(9)

where positive constant

0 < λ < 1

is usually closest to one,

x_{k} \in R^{n}

is the input data, and

d_{k}

is the desired output at iteration k.

Define a residual vector of normal equations as

s_{k} = b_{k} - R_{k} ω_{k}

at iteration k. To improve clarity, the online CG algorithm to estimate the weight vector is summarized as follows [15]. Given the initial conditions

ω_{0} = 0

and

p_{0} = s_{0}

, when {

x_{k}, d_{k}}

is available, do

\begin{matrix} \{\begin{matrix} α_{k} = \frac{s_{k}^{T} p_{k}}{p_{k}^{T} R_{k + 1} p_{k}} \\ ω_{k + 1} = ω_{k} + α_{k} p_{k} \\ R_{k + 1} = λ R_{k} + x_{k + 1} x_{k + 1}^{T} \\ s_{k + 1} = λ s_{k} - α_{k} R_{k + 1} p_{k} + e_{k + 1} x_{k + 1} \\ β_{k} = \frac{s_{k + 1}^{T} s_{k + 1}}{s_{k}^{T} s_{k}} \\ p_{k + 1} = s_{k + 1} + β_{k} p_{k} \end{matrix} . \end{matrix}

(10)

The online CG algorithm is only efficient in dealing with linear problems, but may cause degradation for nonlinear problems. To address this problem, the online kernel conjugate gradient (KCG) based on the least squares (LS) is presented in [30]. Because the LS criterion is used in KCG, it may be degraded considerably or even suffer from divergence in non-Gaussian noise environments. In addition, the linearly growing structure of KCG with each new sample poses both computational and memory issues. To address the complexity issue of KCG, RFFCG is proposed by using the CG method. However, using the LS criterion, RFFCG is only appropriate for Gaussian noises. Therefore, we propose a novel online algorithm to deal with these issues.

3. Proposed Algorithm

In this section, we first describe a fixed dimensional mapping in random Fourier features space (RFFS). Then, the CG method can be applied to the MCL criterion to generate a robust algorithm—that is, the robust random Fourier features Cauchy conjugate gradient (RFFCCG) algorithm.

3.1. Random Fourier Mapping

Based on an available sequence of training data pairs

{x_{k}, d_{k}}_{k = 1}^{N}

, kernel adaptive filters are used to deal with the following nonlinear problem:

\begin{matrix} d_{k} = f^{*} (x_{k}) + υ_{k}, \end{matrix}

(11)

where

x_{k} \in R^{n}

is the input data,

d_{k} \in R

is the desirable output at iteration k in the original data space

U

,

υ_{k}

is an additive noise signal, and

f^{*} (x_{k})

is the optimal estimate of

d_{k}

with a nonlinear input–output mapping

f : R^{n} \to R

.

To construct the mapping relationship in the form of

f (\cdot) = \sum_{l = 1}^{k} η_{l} κ (\cdot, x_{l})

, a kernel method [2] is utilized to map the original input data into a high-dimensional feature space. A continuous, symmetric, and positive definite Mercer kernel is used in the kernel method. Thus, using the kernel method, we can compute the filter output at discrete time k as

\begin{matrix} f (x_{k}) = \sum_{l = 1}^{k - 1} η_{l} κ (x_{k}, x_{l}), \end{matrix}

(12)

with coefficients

η_{l}

,

l = 1, 2, \dots k - 1

. We observe that the network increases linearly with the length of data to train in the filter output, which poses some challenges to the kernel method in practice. However, the random Fourier features (RFFs) present an efficient solution to solving this problem by transforming input

x_{k}

into

z (x_{k})

, that is,

\begin{matrix} z : x_{k} \in R^{n} \to z (x_{k}) \in R^{m}, \end{matrix}

(13)

where

m ≫ n

. Considering a shift-invariant and positive definite kernel

κ (x, y) = κ (x - y)

on

R^{n}

, Bochner’s theorem [12] ensures that the Fourier transform of

ρ (w)

corresponds to the kernel function given by

\begin{matrix} κ (x - y) = \int_{R^{n}} ρ (w) e^{j w^{T} (x - y)} d w, \end{matrix}

(14)

where

ρ (\cdot)

denotes the probability density function (PDF) and

w

is a Gaussian random vector drawn from

ρ (w) \sim N (0, σ^{2} I_{n})

with an

n \times n

identity matrix

I_{n}

. In (14), the commonly used Mercer kernel is the following Gaussian kernel [2], that is,

\begin{matrix} κ (x, y) = \exp (- \frac{{∥x - y∥}_{2}^{2}}{2 σ^{2}}), \end{matrix}

(15)

where

σ

is the kernel bandwidth and

{∥\cdot∥}_{2}

is the

ℓ_{2}

norm.

Now, define

Z_{w} (x) = e^{j w^{T} x}

, and (14) is actually equivalent to the expectation operation of

Z_{w}^{H} (x) Z_{w} (y)

. Then, we have the following expression for kernel function:

\begin{matrix} κ (x - y) = E_{w} [Z_{w}^{H} (x) Z_{w} (y)], \end{matrix}

(16)

where

H

is the conjugate transpose operation and

\begin{matrix} Z_{w} (x) & = e^{j w^{T} x} = \sqrt{2} cos (w^{T} x + b) \\ Z_{w} (y) & = e^{j w^{T} y} = \sqrt{2} cos (w^{T} y + b), \end{matrix}

(17)

with b being drawn from the uniform distribution on

[0, 2 π]

[12].

Based on m random Fourier features

w_{1}, \dots, w_{m}

and m uniformly random numbers

b_{1}, \dots, b_{m}

, the approximation of kernel function in (16) can be rewritten as the empirical average of m random components:

\begin{matrix} Z_{w, b}^{T} (x) Z_{w, b} (y) \approx \frac{1}{m} \sum_{i = 1}^{m} Z_{w_{i}, b_{i}} (x) Z_{w_{i}, b_{i}} (y), \end{matrix}

(18)

where

\begin{matrix} Z_{w, b} (x) & = {[Z_{w_{i}, b_{i}} (x), . . ., Z_{w_{m}, b_{m}} (x)]}^{T} \\ = {[\sqrt{\frac{2}{m}} \cos (w_{1}^{T} x + b_{1}), . . ., \sqrt{\frac{2}{m}} \cos (w_{m}^{T} x + b_{m})]}^{T} . \end{matrix}

(19)

The random Fourier features method for approximating the Gaussian kernel is similar to the approximation method of standard Monte Carlo. Therefore, the transformed input data

z (x_{k})

are

\begin{matrix} z (x_{k}) = \sqrt{\frac{2}{m}} {[cos (w_{1}^{T} x_{k} + b_{1}), . . ., cos (w_{m}^{T} x_{k} + b_{m})]}^{T}, \end{matrix}

(20)

which is called random Fourier mapping (RFM). The dimension of subspace that belongs to

z (x_{k})

is finite. Therefore, a linear adaptive filtering structure is given based on the transformed input

z (x_{k})

.

The Gaussian kernel function in (12) can be represented by the inner product of the transformed input vector as

\begin{matrix} κ (x_{k}, x_{l}) = z^{T} (x_{k}) z (x_{l}), \end{matrix}

(21)

with the low approximation error

ε

provided by

m = O (n ε^{- 2} log \frac{1}{ε^{- 2}})

[14].

Combining (12) and (21), the filter output can be recomputed by

f (x_{k}) = {(Ω^{r})}^{T} z (x_{k})

, where

Ω^{r}

is a finite-dimensional weight vector in RFFS. Thus, an infinite-dimensional implicit feature space is embedded into a relatively low-dimensional explicit feature space. The expression of filter output is similar to the linear least mean square (LMS) [2]. Therefore, a huge amount of time is saved in the RFM-based algorithms. The fixed-dimensional structure will be used to develop the following algorithm. Denote

z (x_{k}) = z_{k}

for simplicity hereafter.

3.2. RFFCCG Algorithm

The Cauchy loss function is presented for robust adaptive filtering as follows:

\begin{matrix} J_{Ω^{r}} = \frac{1}{N} \sum_{k = 1}^{N} ln (1 + \frac{e_{k}^{2}}{γ^{2}}) = \frac{1}{N} \sum_{k = 1}^{N} ln (1 + \frac{{(d_{k} - {(Ω^{r})}^{T} z_{k})}^{2}}{γ^{2}}), \end{matrix}

(22)

where

e_{k} = d_{k} - (Ω^{r})^{T} z_{k}

is the prediction error regarding the transformed input

z_{k} \in R^{m}

,

d_{k} \in R

is the desired output, and

Ω^{r} \in R^{m}

is the weight vector in RFFS. From (5),

(Ω^{r})^{T} z_{k}

corresponds to

y_{k}

.

The gradient of the loss function with respect to

Ω^{r}

is given by

\begin{matrix} \nabla J_{Ω^{r}} & = - \frac{1}{N} \sum_{k = 1}^{N} \frac{2}{γ^{2} + e_{k}^{2}} (d_{k} - {(Ω^{r})}^{T} z_{k}) z_{k} \\ = \frac{1}{N} \sum_{k = 1}^{N} \frac{2}{γ^{2} + e_{k}^{2}} d_{k} z_{k} - \frac{1}{N} \sum_{k = 1}^{N} \frac{2}{γ^{2} + e_{k}^{2}} z_{k} z_{k}^{T} Ω^{r} \\ = \frac{1}{N} \sum_{k = 1}^{N} θ_{k} d_{k} z_{k} - \frac{1}{N} \sum_{k = 1}^{N} θ_{k} z_{k} z_{k}^{T} Ω^{r}, \end{matrix}

(23)

where

θ_{k} = \frac{2}{γ^{2} + e_{k}^{2}}

. Hence, we have that

θ_{k}

tends to 0 as

e_{k} \to \infty

.

Letting

\nabla J_{Ω^{r}} = 0

generates

\begin{matrix} \sum_{k = 1}^{N} θ_{k} z_{k} z_{k}^{T} Ω^{r} = \sum_{k = 1}^{N} θ_{k} d_{k} z_{k}, \end{matrix}

(24)

where weighted auto-correlation matrix

R^{r}

and weighted cross-correlation vector

b^{r}

are given by

\begin{matrix} \{\begin{matrix} R^{r} = \sum_{k = 1}^{N} θ_{k} z_{k} {z_{k}}^{T} \in R^{m \times m} \\ b^{r} = \sum_{k = 1}^{N} θ_{k} d_{k} z_{k} \in R^{m} \end{matrix} . \end{matrix}

(25)

According to (24), (25) can be rewritten as

\begin{matrix} R^{r} Ω^{r} = b^{r}, \end{matrix}

(26)

where the optimal solution

Ω^{r}

, in practice, can be obtained by the CG method instead of estimating

{(R^{r})}^{- 1}

.

For online learning in RFFS,

R^{r}

and

b^{r}

are given by using an exponentially decaying data window as follows:

\begin{matrix} \{\begin{matrix} R_{k + 1}^{r} = λ R_{k}^{r} + θ_{k} z_{k + 1} z_{k + 1}^{T} \\ b_{k + 1}^{r} = λ b_{k}^{r} + θ_{k} d_{k + 1} z_{k + 1} \end{matrix}, \end{matrix}

(27)

where positive forgetting factor

λ

which is very close to but smaller than one (i.e.,

λ \in (0, 1)

) is used to scale down past data. Since

θ_{k}

tends to 0 as

e_{k} \to \infty

for outliers,

R_{k + 1}^{r}

and

b_{k + 1}^{r}

have almost no change from (27). In such case, according to (26),

Ω^{r}

almost has no update, which is robust against outliers.

Then, we use an important concept called the search direction vector here. A fundamental relation between the current optimal weight and the previous optimal weight in RFFS is obtained [25]. We have the recursion to update

Ω^{r}

as

\begin{matrix} Ω_{k + 1}^{r} = Ω_{k}^{r} + α_{k} p_{k}^{r}, \end{matrix}

(28)

which suggests that we can estimate the optimal weight

Ω_{k + 1}^{r}

of a nonlinear dynamical system by combining the previous

Ω_{k}^{r}

with the search direction vector

p_{k}^{r}

multiplied by a step-size parameter

α_{k}

. Furthermore, by the conjugate gradient theory [24,25], we can express

α_{k}

as

\begin{matrix} α_{k} = \frac{{(s_{k}^{r})}^{T} p_{k}^{r}}{{(p_{k}^{r})}^{T} R_{k + 1}^{r} p_{k}^{r}} . \end{matrix}

(29)

Now, a residual vector of normal equations

s^{r}

is introduced in the kernel space. From (27)–(29), we have the recursive residual vector of RFFCCG as follows:

\begin{matrix} s_{k + 1}^{r} & = b_{k + 1}^{r} - R_{k + 1}^{r} Ω_{k + 1}^{r} \\ = λ s_{k}^{r} - α_{k} R_{k + 1}^{r} p_{k}^{r} + z_{k + 1} θ_{k + 1} e_{k + 1}, \end{matrix}

(30)

where

e_{k + 1} = d_{k + 1} - {(Ω_{k}^{r})}^{T} z_{k + 1}

is the prediction error estimated by the desired and the real output in RFFS. These residual vectors are orthogonal to each other, that is,

{(s_{k}^{r})}^{T} s_{l}^{r} = 0

, for

l = 0, \dots, k - 1

.

In [25], the Hessian matrix must be recalculated at each iteration for the Hestenes–Stiefel method, which requires a large amount of computation, and the numerator of the Fletcher–Reeves method may be close zero, resulting in poor performance. From the results of the global convergence characteristics analysis, using the Polak–Ribi

\overset{‘}{e}

re method [22] which adopts a degenerated scheme to calculate

β_{k}

, the performance of the conjugate gradient is the best. Thus, a proper coefficient parameter

β_{k}

(

k = 1, 2, \dots

) is expressed to update the current search direction

p_{k + 1}^{r}

automatically by

\begin{matrix} β_{k} = \frac{{(s_{k + 1}^{r})}^{T} (s_{k + 1}^{r} - s_{k}^{r})}{{(s_{k}^{r})}^{T} s_{k}^{r}} . \end{matrix}

(31)

By using (31) directly, a linear formulation of the new search direction is given by

\begin{matrix} p_{k + 1}^{r} = s_{k + 1}^{r} + β_{k} p_{k}^{r}, \end{matrix}

(32)

where

β_{k}

is set to provide the R-conjugacy for the new search direction

p_{k + 1}^{r}

with regard to the previous direction vectors (i.e.,

p_{1}^{r}, p_{2}^{r}, \dots, p_{k}^{r}

). The search direction is updated per iteration to ensure the convergence of the algorithm. Depending on the spanning subspace theorem [30], the residual vector at iteration k will be orthogonal to the search direction in the Krylov subspace, that is,

s_{k}^{H} p_{l} = 0

, for

l < k

.

Finally, combining Equations (27)–(32), we summarize the online RFFCCG algorithm in Algorithm 1.

Remark 1.

The proposed RFFCCG algorithm uses an explicit mapping method to transform the original input into RFFS, generating a linear filtering structure. Based on a data-independent vector representation, RFFs can provide good tracking performance for non-stationary circumstances. The network size of RFFCCG only depends on the dimension m, which plays a significant role in the approximation accuracy. Generally, a larger dimension simultaneously results in a higher filtering accuracy with more computational and storage burdens. In practice, to balance the filtering accuracy and complexity, an appropriate dimension can be chosen by trials to obtain the desirable performance. In addition, we can find from Figure 1 that the Cauchy loss function can provide good robustness to large outliers. Hence, the proposed RFFCCG can also be applied to channel equalization and noise cancellation [2] in non-Gaussian noise environments.

Algorithm 1: The robust random Fourier features Cauchy conjugate gradient (RFFCCG) algorithm.

Input: Sequential input–output pairs

{x_{k}, d_{k}}_{k = 1}^{N}

, kernel bandwidth

σ > 0

, forgetting factor

λ \in (0, 1)

, the dimension of RFF

m > 0

, and constant

γ > 0

.
Draw: i.i.d.

w \sim N (0, σ^{2} I_{n})

, where the dimension of original data space

n > 0

.
i.i.d.

b \sim U [0, 2 π]

, where U denotes the uniform distribution.
Initialization:

z_{1} = \sqrt{\frac{2}{m}} {[cos (w_{1}^{T} x_{1} + b_{1}), . . ., cos (w_{m}^{T} x_{1} + b_{m})]}^{T}

,

Ω_{1}^{r} = 0

,

d_{1} = e_{1}

,

θ_{1} = \frac{2}{γ^{2} + e_{1}^{2}}

,

R_{1}^{r} = θ_{1} z_{1} z_{1}^{T}

,

b_{1}^{r} = θ_{1} d_{1} z_{1}

,

p_{1}^{r} = b_{1}^{r} - R_{1}^{r} Ω_{1}^{r}

,

p_{1}^{r} = s_{1}^{r}

.
Computation:
while {

x_{k}, d_{k}}

is available, do
1.

z_{k + 1} = \sqrt{\frac{2}{m}} {[cos (w_{1}^{T} x_{k + 1} + b_{1}), . . ., cos (w_{m}^{T} x_{k + 1} + b_{m})]}^{T}

,
2.

y_{k + 1} = 〈Ω_{k + 1}^{r}, z_{k + 1}〉

,
3.

e_{k + 1} = d_{k + 1} - y_{k + 1}

,
4.

θ_{k + 1} = \frac{2}{γ^{2} + e_{k + 1}^{2}}

,
5.

R_{k + 1}^{r} = λ R_{k}^{r} + θ_{k + 1} z_{k + 1} z_{k + 1}^{T}

,
6.

α_{k + 1} = \frac{{(s_{k}^{r})}^{T} p_{k}^{r}}{{(p_{k}^{r})}^{T} R_{k + 1}^{r} p_{k}^{r}}

,
7.

Ω_{k + 1}^{r} = Ω_{k}^{r} + α_{k} p_{k}^{r}

,
8.

s_{k + 1}^{r} = λ s_{k}^{r} - α_{k} R_{k + 1}^{r} p_{k}^{r} + z_{k + 1} θ_{k + 1} e_{k + 1}

,
9.

β_{k} = \frac{{(s_{k + 1}^{r})}^{T} (s_{k + 1}^{r} - s_{k}^{r})}{{(s_{k}^{r})}^{T} s_{k}^{r}}

,
10.

p_{k + 1}^{r} = s_{k + 1}^{r} + β_{k} p_{k}^{r}

.
end while

3.3. Complexity

In this section, we mainly discuss the complexities of RFFCCG in terms of computation and storage. The computational complexity of RFFCCG at each iteration is summarized in detail as follows. It can be seen from Algorithm 1 that at each iteration, RFFCCG requires a total of

4 m^{2} + (n + 12) m + 1

multiplications from Steps 1, 2, and 5–10;

3 m^{2} + (n + 11) m + 1

additions from Steps 1, 2, and 4–10; and 4 divisions from Steps 1, 4, 6, and 9. The computation of

\cos (.)

is not considered here since it can be ignored compared with the cost of RFFCCG. We compare the computational complexities of RFFCCG with the kernel least mean square (KLMS) algorithm [3], kernel recursive least squares (KRLS) algorithm [5], kernel conjugate gradient (KCG) algorithm [30], and kernel recursive maximum correntropy (KRMC) algorithm [31] in Table 1, where k is the number of iterations and n and m are the dimensions of original data space and RFFS, respectively. We clearly see from Table 1 that the proposed RFFCCG algorithm has a fixed computational complexity, but the computational complexities of KLMS, KRLS, KCG, and KRMC increase with the network size. Thus, RFFCCG can significantly reduce the computational requirements.

Further, we discuss the storage complexity of RFFCCG based on the matrix number and size. From Algorithm 1, there is only an

m \times m

symmetric matrix

R

that needs to be stored. Therefore, the storage complexity of RFFCCG is

O (m)

. Table 2 lists the compared results with other algorithms based on matrix. Note that the storage complexity of RFFCCG is also fixed, and other algorithms have increasing network sizes. In summary, compared with the KAFs without sparsification in Table 1 and Table 2, RFFCCG can efficiently alleviate the computational and storage burdens of KAFs.

4. Simulation

To demonstrate the superior performance of the proposed RFFCCG algorithm in this section, simulations were performed on the Mackey–Glass chaotic time series prediction and nonlinear system identification, respectively. Due to the modest complexity and excellent performance, representative algorithms (i.e., random Fourier features kernel least mean square (RFFKLMS) algorithm [13], quantized kernel recursive least squares (QKRLS) algorithm [32], random Fourier features maximum correntropy (RFFMC) algorithm [14], kernel recursive maximum correntropy algorithm with novelty criterion (KRMC-NC) [31], and random Fourier features conjugate gradient (RFFCG) algorithm [15]) were selected to compare the performance of RFFCCG. Among these algorithms, RFFMC and KRMC-NC are typical robust algorithms, while RFFKLMS, QKRLS, and RFFCG with no robustness are also used for the filtering performance reference. For all simulations, we ran 50 independent Monte Carlo simulations to reduce disturbances using Matlab R2016b on Windows 10, where PC is configured with 3.30 GHz of CPU and 8 GB of RAM.

To evaluate the filtering performance of algorithms, the testing mean-square error (MSE) is defined as:

\begin{matrix} MSE (dB) = 10 {log}_{10} (\frac{1}{N} \sum_{k = 1}^{N} {(d_{k} - y_{k})}^{2}), \end{matrix}

(33)

where

y_{k}

is the prediction of

d_{k}

and N is the length of testing data.

The non-Gaussian noise model considered in this section is the impulsive noise [33], which was modeled by the combination of two mutually independent noise processes. We assumed the mixture noise model in the form of

υ_{k} = (1 - a_{k}) A_{k} + a_{k} B_{k}

, where

a_{k} \in (0, 1)

is a binary distribution with occurrence probability

P (a_{k} = 1) = c

and

P (a_{k} = 0) = 1 - c

(0 \leq c \leq 1)

. Without mentioning otherwise, the parameter c was configured to 0.1 and

A_{k}

is a zero-mean Gaussian distribution with

σ_{A}^{2} = 0.01

. For

B_{k}

, we mainly considered the

α

-stable noise (heavy-tailed impulsive noise) process with characteristic function [34]:

\begin{matrix} φ (t) = exp {j (ε t) - η {|t|}^{α} [1 + j (β sgn (t) S (t, α))]}, \end{matrix}

(34)

where

\begin{matrix} S (t, α) = \{\begin{matrix} tan \frac{α π}{2}, & α \neq 1, \\ \frac{2}{π} log |t|, & α = 1, \end{matrix} \end{matrix}

(35)

where

α \in (0, 2]

is the characteristic factor to measure the heaviness of the tail and a smaller parameter

α

means a larger impulse,

η

is the dispersion factor that controls the number of impulses,

- \infty < ε < + \infty

is the location factor,

β \in [- 1, 1]

is the symmetry factor,

sgn (\cdot)

is the sign function, and

j = \sqrt{- 1}

. The parameter vector of the noise model is written as

V_{α - s t a b l e} = [α, β, η, ε]

. Here, we chose the parameter vector

V_{α - s t a b l e} = [0.8, 0, 0.1, 0]

in the following simulations.

4.1. Mackey–Glass Time Series

Since the Mackey–Glass (MG) chaotic system is a benchmark problem for nonlinear learning problems, we first considered the MG chaotic time series [2] in the following simulations, which is generated by the delayed differential equation as follows:

\begin{matrix} \frac{d u (t)}{d t} = - b u (t) + \frac{a u (t - τ)}{1 + u {(t - τ)}^{n}}, \end{matrix}

(36)

with

a = 0.2

,

b = 0.1

,

n = 10

, and

τ = 30

. The time series was discretized at the sampling period of 6 s and corrupted by the noise model mentioned above. We used the previous seven points

{\{u_{k - 7}, u_{k - 6}, \dots, u_{k - 1}\}}^{T}

to predict the current value

u_{k}

. The prediction was trained by 2000 data points, and tested with another 200.

The parameter

γ

is key in the proposed RFFCCG algorithm. In the first simulation, we discuss the influence of

γ

on the filtering accuracy of RFFCCG to combat non-Gaussian noises. The parameter was selected within the range

γ = [0 . 01, 0 . 1, 0 . 3, 0 . 5, 0 . 7, 0 . 9, 1, 2, 4, 6]

. The influence of

γ

is shown in Figure 2, where the steady-state MSEs are derived by averaging the last 200 iterations. For RFFCCG, the Gaussian kernel bandwidth was set as

σ = 1

, the forgetting factor

β = 0 . 999

, and the dimension of RFF

m = 100

. It can be seen from Figure 2 that the parameter

γ

had a direct influence on the filtering performance of RFFCCG. The RFFCCG algorithm could achieve the highest filtering accuracy when

γ = 0.3

, and thus too large or too small

γ

will cause performance degradation. An appropriate

γ

can combat impulsive noises efficiently. Therefore, we set the parameter

γ = 0.3

for RFFCCG in the following simulations.

In addition, the steady-state MSEs and the averaged time are plotted in Figure 3 regarding different dimension m. Here, the simulation environment and kernel bandwidth setting of RFFCCG were similar to those of Figure 2. The range of m was set as [1,100]. From Figure 3, we observe: (1) the average consumed time increased linearly with m; (2) the filtering accuracy of RFFCCG could be improved by increasing m to some extent, however it remained almost unchanged when

m \geq 60

. In addition, a larger dimension m resulted in higher filtering accuracy at the expense of increasing computational time. Thus, the dimension

m = 60

was set for RFFCCG to provide a trade-off between filtering accuracy and computational time.

In this example, we compared the filtering accuracy and robust performance of RFFCCG with other filtering algorithms. The parameters of each algorithm were set to achieve the desired filtering accuracy and to have the same convergence rate. The bandwidth of Gaussian kernels was set to 1 for all algorithms; the step size was

η = 0.1

in RFFKLMS and RFFMC; the threshold

ε = 0 . 05

was chosen for QKRLS; the distance threshold and the error threshold were set as

δ_{1} = 0 . 1

and

δ_{2} = 0 . 1

, respectively, and the regularization parameter

λ = 0 . 1

for the KRMC-NC; for RFFCG and RFFCCG, the forgetting factor was set to

β = 0 . 999

;

γ = 0.3

was chosen for RFFCCG; the dimension of RFFS was

m = 60

. From Figure 4, we observe that the performance of quadratic-based algorithms (i.e., RFFKLMS, QKRLS, and RFFCG) became worse in the non-Gaussian noise environment, while RFFMC, KRMC-NC, and RFFCCG always generated stable performance and achieved desirable performance when impulse noise appeared. Especially, the filtering performance of RFFCCG was very close to that of the recursive KRMC-NC algorithm and better than that of the SGD-based RFFMC algorithm. Table 3 lists the detailed simulation results in terms of the dictionary size, steady-state MSE, and average consumed time. One also can observe that RFFCCG could produce the comparable filtering accuracy to KRMC-NC with less consumed time and storage requirements. Thus, RFFCCG is more efficient in the compared algorithms for the MG time series prediction.

4.2. Nonlinear System Identification

To further validate the superiority of RFFCCG, we chose the problem of nonlinear system identification, where the nonlinear system is of the form [35]

\begin{matrix} u_{k} = & (a_{1} - a_{2} exp (u_{k - 1}^{2})) u_{k - 1} - (a_{3} + a_{4} exp (- u_{k - 1}^{2})) u_{k - 2} \\ + a_{5} sin (u_{k - 1} π), \end{matrix}

(37)

where

u_{k}

denotes the output at discrete time k,

u_{1} = 0.1

, and

u_{2} = 0.1

were configured as the initial values, and

a = {[a_{1}, a_{2}, a_{3}, a_{4}, a_{5}]}^{T}

denotes the coefficient vector. The setting for prediction task is shown as follows: the previous two values (i.e.,

u = [u_{k - 1}, u_{k - 2}]

) were used as the input vector to predict the current value

u_{k}

. We considered stationary and non-stationary scenarios in the following simulations. The data were corrupted by the noise model mentioned above and the Gaussian kernel with kernel parameter

σ = 1

was used for all the tested algorithms.

In the stationary case, the coefficient vector was fixed at

a = {[0.8, 0.5, 0.3, 0.9, 0.1]}^{T}

. The first 2000 data points were used for training and the additional 200 for testing. We compared the testing MSE of RFFCCG with those of RFFKLMS, QKRLS, RFFMC, KRMC-NC, and RFFCG due to their modest complexities and excellent performance under the stationary system. The parameters were chosen to obtain the best results as follows:

η = 0.1

for RFFKLMS;

ε = 0 . 002

for QKLMS;

η = 0.4

for RFFMC;

δ_{1} = 0 . 01

,

δ_{2} = 0 . 01

, and

λ = 0 . 1

for KRMC-NC;

β = 0 . 999

for RFFCG;

β = 0 . 999

and

γ = 0.3

for RFFCCG. To balance the accuracy and computational time,

m = 50

was also configured for the dimension of RFFS. The learning curves of all the algorithms are shown in Figure 5. In this case, the RFFCCG algorithm still had satisfactory prediction ability and achieved comparable performance to KRMN-NC and better performance than RFFMC, while others exhibited poor performance. This also means that RFFCCG has strong robustness against impulsive noises. Table 4 shows the dictionary size, steady-state MSEs, and average consumed time of all algorithms. As can be clearly seen from Figure 5 and Table 4, RFFCCG consumed less time and achieved a faster convergence rate and higher filtering accuracy than the compared algorithms including RFFCG.

The tracking performance was evaluated in a non-stationary system where two different coefficient vectors were used for data generation as follows:

a = {[0.8, 0.5, 0.3, 0.3, 0.1]}^{T}

was selected in the first 2000 data, and

a = {[0.4, 0.7, 0.6, 0.6, 0.2]}^{T}

was set in the following 2000 data. We compared the testing MSE of RFFCCG with those of RFFKLMS, RFFMC, RFFCG, and RFFCCG due to their modest complexities and excellent performance in the non-stationary system. To compute the convergence curve, a total of 4000 data points were used for training with a sudden change at the 2001-th data point. Regarding the test process, 400 data points were used with a sudden change at the 201-th data point. With the same criterion of parameters setting, the step sizes

η

of RFFKLMS and RFFMC were chosen as 0.1 and 0.3, respectively, the forgetting factor

β = 0 . 999

was used in RFFCG and RFFCCG, and the dimension of RFF was set as

m = 50

. The performance comparison is presented in Figure 6. It can be observed that all of the RFF-based algorithms were capable of tracking the change of the system. However, RFFCCG outperformed all the compared algorithms with

γ = 0.3

when abrupt change occurred. Note that the dictionary size, steady-state MSEs, and consumed time of the tested algorithms averaged by the points of

1500 \sim 2000

and

3500 \sim 4000

are summarized in Table 5. As observed in Figure 6 and Table 5, RFFCCG provided good tracking performance for a non-stationary system in non-Gaussian noises.

Therefore, in both stationary and non-stationary circumstances for the nonlinear system identification, the proposed RFFCCG algorithm offers excellent filtering performance in terms of filtering accuracy, convergence rate, robustness, and computational and space complexities.

5. Conclusions

In this paper, the robust random Fourier features Cauchy conjugate gradient (RFFCCG) algorithm was proposed and analyzed by integrating random Fourier mapping (RFM) into the Cauchy loss function with the conjugate gradient (CG) optimization method for nonlinear applications in a non-Gaussian noise circumstance. Random Fourier mapping (RFM) for a kernel adaptive filter (KAF) generates an effective finite-dimensional sparsification approach to obtain a more accurate and compact network. The developed RFFCCG algorithm with low computational and space complexities could significantly improve the filtering performance in comparison with other representative robust filters against non-Gaussian noise interferences. We discussed the influence of free parameters, and obtained optimal parameter values for RFFCCG. Simulation results in the presence of non-Gaussian noises validated the superiority of RFFCCG in terms of the filtering accuracy, robustness, tracking performance, and time and storage consumption.

Author Contributions

Conceptualization, X.H. and S.W.; methodology, X.H. and K.X.; software, K.X. and S.W.; validation, S.W. and X.H.; formal analysis, X.H. and K.X.; investigation, X.H. and K.X.; resources, S.W.; data curation, S.W. and X.H.; writing—original draft preparation, X.H.; writing—review and editing, S.W. and K.X.; visualization, X.H.; supervision, S.W.; project administration, S.W. and K.X.; funding acquisition, S.W. and K.X.

Funding

This work was supported by the National Natural Science Foundation of China (61671389) and Fundamental Research Funds for the Central Universities (XDJK2019B011).

Conflicts of Interest

The authors declare no conflict of interest.

References

Kivinen, J.; Smola, A.J.; Williamson, R.C. Online learning with kernels. IEEE Trans. Signal Process. 2004, 52, 1540–1547. [Google Scholar] [CrossRef]
Liu, W.; Príncipe, J.C.; Haykin, S. Kernel Adaptive Filtering: A Comprehensive Introduction; John Wiley & Sons: Hoboken, NJ, USA, 2010. [Google Scholar]
Liu, W.; Pokharel, P.P.; Príncipe, J.C. The kernel least mean square algorithm. IEEE Trans. Signal Process. 2008, 56, 543–554. [Google Scholar] [CrossRef]
Liu, W.; Príncipe, J.C. Kernel affine projection algorithms. EURASIP J. Adv. Signal Process. 2008, 2008, 1–12. [Google Scholar]
Engel, Y.; Mannor, S.; Meir, R. The kernel recursive least squares algorithm. IEEE Trans. Signal Process. 2004, 52, 2275–2285. [Google Scholar] [CrossRef]
Liu, W.; Park, I.; Príncipe, J.C. An information theoretic approach of designing sparse kernel adaptive filters. IEEE Trans. Neural Netw. 2009, 20, 1950–1961. [Google Scholar] [CrossRef]
Richard, C.; Bermudez, J.C.M.; Honeine, P. Online prediction of time series data with kernels. IEEE Trans. Signal Process. 2009, 57, 1058–1067. [Google Scholar] [CrossRef]
Wang, S.; Zheng, Y.; Duan, S.; Wang, L.; Tan, T. Quantized kernel maximum correntropy and its mean square convergence analysis. Dig. Signal Process. 2017, 63, 164–176. [Google Scholar] [CrossRef]
Zhao, S.; Chen, B.; Zhu, P.; Príncipe, J.C. Fixed budget quantized kernel least mean square algorithm. Signal Process. 2013, 93, 2759–2770. [Google Scholar] [CrossRef]
Vaerenbergh, S.V.; Vía, J.; Santamaría, I. A sliding-window kernel RLS algorithm and its application to nonlinear channel identification. In Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (ICASSP), Toulouse, France, 14–19 May 2006; pp. 789–792. [Google Scholar]
Wang, S.; Wang, W.; Dang, L.; Jiang, Y. Kernel least mean square based on the Nyström method. Circuits Syst. Signal Process. 2019, 38, 3133–3151. [Google Scholar] [CrossRef]
Rahimi, A.; Recht, B. Random features for large-scale kernel machines. In Proceedings of the 21th Annual Conference on Neural Information Processing Systems (ACNIPS), Vancouver, BC, Canada, 3–6 December 2007; pp. 1177–1184. [Google Scholar]
Singh, A.; Ahuja, N.; Moulin, P. Online learning with kernels: Overcoming the growing sum problem. In Proceedings of the 2012 IEEE International Workshop on Machine Learning for Signal Process (MLSP), Santander, Spain, 23–26 September 2012; pp. 1–6. [Google Scholar]
Wang, S.; Dang, L.; Chen, B.; Duan, S.; Wang, L.; Tse, C.K. Random Fourier filters under maximum correntropy criterion. IEEE Trans. Circuits Syst. I Reg. Pap. 2018, 65, 3390–3403. [Google Scholar] [CrossRef]
Xiong, K.; Wang, S. The online random Fourier features conjugate gradient algorithm. IEEE Signal Process. Lett. 2019, 26, 740–744. [Google Scholar] [CrossRef]
Wu, Q.; Li, Y.; Xue, W. A kernel recursive maximum versoria-like criterion algorithm for nonlinear channel equalization. Symmetry 2019, 11, 1067. [Google Scholar] [CrossRef]
Mathews, V.J.; Cho, S.H. Improved convergence analysis of stochastic gradient adaptive filters using the sign algorithm. IEEE Trans. Acoust. Speech Signal Process. 1987, 35, 450–454. [Google Scholar] [CrossRef]
Walach, E.; Widrow, B. The least mean fourth (LMF) adaptive algorithm and its family. IEEE Trans. Inf. Theory 1984, 42, 275–283. [Google Scholar] [CrossRef]
Príncipe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
Li, Y.; Wang, Y.; Sun, L. A proportionate normalized maximum correntropy criterion algorithm with correntropy induced metric constraint for identifying sparse systems. Symmetry 2018, 10, 683. [Google Scholar] [CrossRef]
Gallagher, C.H.; Fisher, T.J.; Shen, J. A cauchy estimator test for autocorrelation. J. Stat. Comput. Simul. 2015, 85, 1264–1276. [Google Scholar] [CrossRef]
Luenberger, D.G. Linear and Nonlinear Programming, 4th ed.; Prentice Hall: Englewood Cliffs, NJ, USA, 2016. [Google Scholar]
Yang, J.; Ye, F.; Rong, H.J.; Chen, B. Recursive least mean p-Power extreme learning machine. Neural Netw. 2017, 91, 22–33. [Google Scholar] [CrossRef]
Boray, G.K.; Srinath, M.D. Conjugate gradient techniques for adaptive filtering. IEEE Trans. Circuits Syst. I Fundam. Theory Appl. 1992, 39, 1–10. [Google Scholar] [CrossRef]
Chang, P.S.; Willson, A.N. Analysis of conjugate gradient algorithms for adaptive filtering. IEEE Trans. Signal Process. 2008, 48, 409–418. [Google Scholar] [CrossRef]
Hestenes, M.R.; Stiefel, E. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Stand. 1952, 49, 409–436. [Google Scholar] [CrossRef]
Dassios, I.; Fountoulakis, K.; Gondzio, J. A preconditioner for a primal-dual newton conjugate gradients method for compressed sensing problems. SIAM J. Sci. Comput. 2015, 37, 2783–2812. [Google Scholar] [CrossRef]
Heravi, A.R.; Hodtani, G.A. A new correntropy-based conjugate gradient backpropagation algorithm for improving training in neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 6252–6263. [Google Scholar] [CrossRef] [PubMed]
Caliciotti, A.; Fasano, G.; Roma, M. Preconditioned nonlinear conjugate gradient methods based on a modified secant equation. Appl. Math. Comput. 2018, 318, 196–214. [Google Scholar] [CrossRef]
Zhang, M.; Wang, X.; Chen, X.; Zhang, A. The kernel conjugate gradient algorithms. Trans. Signal Process. 2018, 66, 4377–4387. [Google Scholar] [CrossRef]
Wu, Z.; Shi, J.; Zhang, X.; Ma, W.; Chen, B. Kernel recursive maximum correntropy. Signal Process. 2015, 117, 11–16. [Google Scholar] [CrossRef]
Chen, B.; Zhao, S.; Zhu, P.; Príncipe, J.C. Quantized kernel recursive least squares algorithm. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 1484–1491. [Google Scholar] [CrossRef]
Chen, B.; Xing, L.; Zhao, H.; Zheng, N.; Príncipe, J.C. Generalized correntropy for robust adaptive filtering. IEEE Trans. Signal Process. 2016, 64, 3376–3387. [Google Scholar] [CrossRef]
Weng, B.; Barner, K.E. Nonlinear system identification in impulsive environments. IEEE Trans. Signal Process. 2005, 53, 2588–2594. [Google Scholar] [CrossRef]
Chen, S.; Billings, S.A.; Grant, P.M. Recursive hybrid algorithm for non-linear system identification using radial basis function networks. Int. J. Control 1992, 55, 1051–1070. [Google Scholar] [CrossRef]

Figure 1. Comparison of mean square error (MSE) and Cauchy loss (CL) with different

γ

.

Figure 1. Comparison of mean square error (MSE) and Cauchy loss (CL) with different

γ

.

Figure 2. Steady-state MSEs of RFFCCG versus different

γ

in Mackey–Glass (MG) time series prediction for non-Gaussian noises.

Figure 2. Steady-state MSEs of RFFCCG versus different

γ

in Mackey–Glass (MG) time series prediction for non-Gaussian noises.

Figure 3. Comparison of steady-state MSEs and average consumed time of RFFCCG versus m in MG time series prediction for non-Gaussian noises.

Figure 4. Learning curves of RFFCCG and different algorithms versus

m = 60

in MG time series prediction for non-Gaussian noises. KRMC-NC: kernel recursive maximum correntropy algorithm with novelty criterion; QKLMS: quantized kernel recursive least squares; RFFCG: random Fourier features conjugate gradient; RFFKLMS: random Fourier features kernel least mean square; RFFMC: random Fourier features maximum correntropy.

Figure 4. Learning curves of RFFCCG and different algorithms versus

m = 60

in MG time series prediction for non-Gaussian noises. KRMC-NC: kernel recursive maximum correntropy algorithm with novelty criterion; QKLMS: quantized kernel recursive least squares; RFFCG: random Fourier features conjugate gradient; RFFKLMS: random Fourier features kernel least mean square; RFFMC: random Fourier features maximum correntropy.

Figure 5. Learning curves of RFFCCG and different algorithms versus

m = 50

in nonlinear system identification of a stationary environment for non-Gaussian noises.

Figure 5. Learning curves of RFFCCG and different algorithms versus

m = 50

in nonlinear system identification of a stationary environment for non-Gaussian noises.

Figure 6. Learning curves of RFFCCG and different algorithms versus

m = 50

in nonlinear system identification of a non-stationary environment for non-Gaussian noises.

Figure 6. Learning curves of RFFCCG and different algorithms versus

m = 50

in nonlinear system identification of a non-stationary environment for non-Gaussian noises.

Table 1. Computational cost of algorithms per iteration. KCG: kernel conjugate gradient; KLMS: kernel least mean square; KRLS: kernel recursive least squares; KRMC: kernel recursive maximum correntropy.

Algorithm	Addition	Multiplication	Division
1-4KLMS [3]	k	k	0
KRLS [5]	$4 k^{2} + 4 k$	$4 k^{2} + 4 k$	1
KRMC [31]	$4 k^{2} + 4 k$	$4 k^{2} + 4 k$	2
KCG [30]	$2 k^{2} + 8 k$	$2 k^{2} + 10 k$	3
RFFCCG	$3 m^{2} + (n + 11) m + 1$	$4 m^{2} + (n + 12) m + 1$	4

Table 2. Storage cost of algorithms per iteration.

Algorithm	Matrix
Algorithm	$m \times m$	$n \times k$	$k \times k$
KLMS [3]	0	1	0
KRLS [5]	0	1	1
KRMC [31]	0	1	1
KCG [30]	0	1	1
RFFCCG	1	0	0

Table 3. Simulation results of RFFKLMS, QKRLS, RFFKMC, KRMC-NC, RFFCG, and RFFCCG in MG time series prediction for non-Gaussian noises.

Algorithm	Size	MSE $(dB)$	Consumed Time $(s)$
RFFKLMS [13]	60	N/A	2.6305
QKRLS [32]	182	N/A	8.2586
RFFMC [14]	60	−22.3870	2.6282
KRMC-NC [31]	500	−29.6680	3.6299
RFFCG [15]	60	N/A	2.6183
RFFCCG	60	−30.5060	2.6750

Table 4. Simulation results of RFFKLMS, RFFMCC, RFFCG, and RFFCCG in nonlinear system identification of a stationary environment for non-Gaussian noises.

Algorithm	Size	MSE $(dB)$	Consumed Time $(s)$
RFFKLMS [13]	50	N/A	2.2773
QKRLS [32]	160	N/A	7.1883
RFFMC [14]	50	−23.6263	2.3359
KRMC-NC [31]	500	−34.3570	2.9277
RFFCG [15]	50	N/A	2.3768
RFFCCG	50	−38.9975	2.3390

Table 5. Simulation results of RFFKLMS, RFFMCC, RFFCG, and RFFCCG in nonlinear system identification of a non-stationary environment for non-Gaussian noises.

Algorithm	Size	MSE $(dB)$	Consumed Time $(s)$
RFFKLMS [13]	50	N/A	0.1048
RFFMC [14]	50	−24.4093/−21.5835	0.1039
RFFCG [15]	50	N/A	0.1794
RFFCCG	50	−39.3061/−34.0029	0.1786

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, X.; Wang, S.; Xiong, K. The Cauchy Conjugate Gradient Algorithm with Random Fourier Features. Symmetry 2019, 11, 1323. https://doi.org/10.3390/sym11101323

AMA Style

Huang X, Wang S, Xiong K. The Cauchy Conjugate Gradient Algorithm with Random Fourier Features. Symmetry. 2019; 11(10):1323. https://doi.org/10.3390/sym11101323

Chicago/Turabian Style

Huang, Xuewei, Shiyuan Wang, and Kui Xiong. 2019. "The Cauchy Conjugate Gradient Algorithm with Random Fourier Features" Symmetry 11, no. 10: 1323. https://doi.org/10.3390/sym11101323

APA Style

Huang, X., Wang, S., & Xiong, K. (2019). The Cauchy Conjugate Gradient Algorithm with Random Fourier Features. Symmetry, 11(10), 1323. https://doi.org/10.3390/sym11101323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Cauchy Conjugate Gradient Algorithm with Random Fourier Features

Abstract

1. Introduction

2. Background

2.1. Minimum Cauchy Loss Criterion

2.2. Conjugate Gradient Algorithm

2.3. Online Conjugate Gradient Algorithm

3. Proposed Algorithm

3.1. Random Fourier Mapping

3.2. RFFCCG Algorithm

3.3. Complexity

4. Simulation

4.1. Mackey–Glass Time Series

4.2. Nonlinear System Identification

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI