1. Introduction
Many applications in the real world, such as system identification, regression, and online kernel learning (OKL) [
1], require complex nonlinear models. The kernel method using a Mercer kernel has attracted interests in tackling these complex nonlinear applications, which transforms nonlinear applications into linear ones in the reproducing kernel Hilbert space (RKHS) [
2]. Developed in RKHS, a kernel adaptive filter (KAF) [
2] is the most celebrated subfield of OKL algorithms. Using the simplest stochastic gradient descent (SGD) method for learning, KAFs including the kernel least mean square (KLMS) algorithm [
3], kernel affine projection algorithm (KAPA) [
4], and kernel recursive least squares (KRLS) algorithm [
5] have been proposed.
However, allocating a new kernel unit as a radial basis function (RBF) center with the coming of new data, the linearly growing structure (called “dictionary” hereafter) will increase the computational and memory requirements in KAFs. To curb the growth of the dictionary, two categories are chosen for sparsification. The first category accepts only informative data as new dictionary centers by using a threshold, including the surprise criterion (SC) [
6], the coherence criterion (CC) [
7], and the vector quantization (VQ) [
8]. However, these methods cannot fully address the growing problem and still introduce additional time consumption at each iteration. The fixed points methods as the second category, including the fixed-budget (FB) [
9], the sliding window (SW) [
10], and the kernel approximation methods (e.g., the Nystr
m method [
11] and random Fourier features (RFFs) method [
12]), are used to overcome the sublinearly growing problem. However, the FB method and the SW method cannot guarantee a good performance in specific environments with a small amount of time [
13]. Compared with the Nystr
m method, RFFs are drawn from a distribution that is randomly independent from the training data. Due to a data-independent vector representation, RFFs can provide a good solution to non-stationary circumstances. On the basis of RFFs, random Fourier mapping (RFM) is proposed by mapping input data into a finite-dimensional random Fourier features space (RFFS) using a randomized feature kernel’s Fourier transform in a fixed network structure. The RFM alleviates the computational and storage burdens of KAFs, and ensures a satisfactory performance under non-stationary conditions. The examples for developing KAFs with RFM are the random Fourier features kernel least mean square (RFFKLMS) algorithm [
13], random Fourier features maximum correntropy (RFFMC) algorithm [
14], and random Fourier features conjugate gradient (RFFCG) algorithm [
15].
For the loss function, due to their simplicity, smoothness, and mathematical tractability, the second-order statistical measures (e.g., minimum mean square error (MMSE) [
2] and least squares [
16]) are widely utilized in KAFs. However, KAFs based on the second-order statistical measures are sensitive to non-Gaussian noises including the sub-Gaussian and super-Gaussian noises, which means that their performance may be seriously degraded if the training data are contaminated by outliers. To handle this issue, robust statistical measures have therefore gained more attention, among which the lower-order error measure [
17] and the higher-lower error measure [
18] are two typical examples. However, the higher-order error measure is not suitable for the mixture of Gaussian and super-Gaussian noises (Laplace,
-stable, etc.) with poor stability and astringency, and the lower-order measure of error is usually more desirable in these noise environments with slow convergence rate. Recently, the information theoretic learning (ITL) [
19] similarity measures, such as the maximum correntropy criterion (MCC) [
20] and minimum error entropy criterion (MEE) [
19], have been introduced to implement robust KAFs. The ITL similarity measures have been shown to have a strong robustness against non-Gaussian noises at the expense of increasing computational burden in training processing. In addition, minimizing the logarithmic moments of the error, the logarithmic error measure—including the Cauchy loss (CL) [
21] with low computational complexity—is an appropriate measure of optimality. Using the Cauchy loss to penalize the noise term, some algorithms based on the minimum Cauchy loss (MCL) criterion are efficient for combating non-Gaussian noises, especially for heavy-tailed
- stable noises.
From the aspect of the optimization method, the stochastic gradient descent (SGD)-based algorithms cannot find the minimum using the negative gradient in some loss functions [
20,
21,
22]. Toward this end, recursive-based algorithms [
23] address these issues at the cost of increasing computational cost. In comparison with the SGD method and recursive method, the conjugate gradient (CG) method [
24,
25,
26] and Newton’s method as developments of SGD have become alternative optimization methods in KAFs. The inverse of matrix of Newton’s method increases the computation and causes the divergence of algorithms in some cases [
22]. However, the CG method gives a trade-off between convergence rate and computational complexity without the inverse computation, and has been successfully applied in various fields, including compressed sensing [
27], neural networks [
28], and large-scale optimization [
29]. In addition, the kernel conjugate gradient (KCG) method is proposed [
30] for adaptive filtering. KCG with low computational and space requirements can produce a better solution than KLMS, and has comparable accuracy to KRLS.
In this paper, to reduce the computational complexity, we apply the RFM in the MCL-based KAF to address the problem of linear growth and improve the robustness. Further, the CG optimization method is used to improve the filtering accuracy and convergence rate, developing a novel robust random Fourier features Cauchy conjugate gradient (RFFCCG) algorithm. The contributions of this paper are summarized as follows. 1) Inspired by the finite-dimensional RFM and MCL criterion, a novel RFFCCG algorithm is derived by mapping the original input data into the fixed-dimensional RFFS, which can significantly solve the problem of the growth of network structure and improve robustness compared to other robust algorithms in the context of non-Gaussian noises. 2) By applying the CG method, RFFCCG with low computational and space complexities provides good filtering accuracy against non-Gaussian noises. The computational and space complexities of RFFCCG are also discussed. 3) The proposed algorithm can also achieve excellent tracking performance when a system has a sudden change.
The rest of this paper is structured as follows. The MCL criterion and its convexity are described in
Section 2, and the online CG algorithm is also briefly reviewed in this section. In
Section 3, we present the proposed RFFCCG algorithm and its complexity analysis. Illustrative simulations in the presence of non-Gaussian noises are presented to confirm the effectiveness of the proposed algorithm in
Section 4. Finally,
Section 5 gives the concluding remarks of this paper.
4. Simulation
To demonstrate the superior performance of the proposed RFFCCG algorithm in this section, simulations were performed on the Mackey–Glass chaotic time series prediction and nonlinear system identification, respectively. Due to the modest complexity and excellent performance, representative algorithms (i.e., random Fourier features kernel least mean square (RFFKLMS) algorithm [
13], quantized kernel recursive least squares (QKRLS) algorithm [
32], random Fourier features maximum correntropy (RFFMC) algorithm [
14], kernel recursive maximum correntropy algorithm with novelty criterion (KRMC-NC) [
31], and random Fourier features conjugate gradient (RFFCG) algorithm [
15]) were selected to compare the performance of RFFCCG. Among these algorithms, RFFMC and KRMC-NC are typical robust algorithms, while RFFKLMS, QKRLS, and RFFCG with no robustness are also used for the filtering performance reference. For all simulations, we ran 50 independent Monte Carlo simulations to reduce disturbances using Matlab R2016b on Windows 10, where PC is configured with 3.30 GHz of CPU and 8 GB of RAM.
To evaluate the filtering performance of algorithms, the testing mean-square error (MSE) is defined as:
where
is the prediction of
and
N is the length of testing data.
The non-Gaussian noise model considered in this section is the impulsive noise [
33], which was modeled by the combination of two mutually independent noise processes. We assumed the mixture noise model in the form of
, where
is a binary distribution with occurrence probability
and
. Without mentioning otherwise, the parameter
c was configured to 0.1 and
is a zero-mean Gaussian distribution with
. For
, we mainly considered the
-stable noise (heavy-tailed impulsive noise) process with characteristic function [
34]:
where
where
is the characteristic factor to measure the heaviness of the tail and a smaller parameter
means a larger impulse,
is the dispersion factor that controls the number of impulses,
is the location factor,
is the symmetry factor,
is the sign function, and
. The parameter vector of the noise model is written as
. Here, we chose the parameter vector
in the following simulations.
4.1. Mackey–Glass Time Series
Since the Mackey–Glass (MG) chaotic system is a benchmark problem for nonlinear learning problems, we first considered the MG chaotic time series [
2] in the following simulations, which is generated by the delayed differential equation as follows:
with
,
,
, and
. The time series was discretized at the sampling period of 6 s and corrupted by the noise model mentioned above. We used the previous seven points
to predict the current value
. The prediction was trained by 2000 data points, and tested with another 200.
The parameter
is key in the proposed RFFCCG algorithm. In the first simulation, we discuss the influence of
on the filtering accuracy of RFFCCG to combat non-Gaussian noises. The parameter was selected within the range
. The influence of
is shown in
Figure 2, where the steady-state MSEs are derived by averaging the last 200 iterations. For RFFCCG, the Gaussian kernel bandwidth was set as
, the forgetting factor
, and the dimension of RFF
. It can be seen from
Figure 2 that the parameter
had a direct influence on the filtering performance of RFFCCG. The RFFCCG algorithm could achieve the highest filtering accuracy when
, and thus too large or too small
will cause performance degradation. An appropriate
can combat impulsive noises efficiently. Therefore, we set the parameter
for RFFCCG in the following simulations.
In addition, the steady-state MSEs and the averaged time are plotted in
Figure 3 regarding different dimension
m. Here, the simulation environment and kernel bandwidth setting of RFFCCG were similar to those of
Figure 2. The range of
m was set as [1,100]. From
Figure 3, we observe: (1) the average consumed time increased linearly with
m; (2) the filtering accuracy of RFFCCG could be improved by increasing
m to some extent, however it remained almost unchanged when
. In addition, a larger dimension
m resulted in higher filtering accuracy at the expense of increasing computational time. Thus, the dimension
was set for RFFCCG to provide a trade-off between filtering accuracy and computational time.
In this example, we compared the filtering accuracy and robust performance of RFFCCG with other filtering algorithms. The parameters of each algorithm were set to achieve the desired filtering accuracy and to have the same convergence rate. The bandwidth of Gaussian kernels was set to 1 for all algorithms; the step size was
in RFFKLMS and RFFMC; the threshold
was chosen for QKRLS; the distance threshold and the error threshold were set as
and
, respectively, and the regularization parameter
for the KRMC-NC; for RFFCG and RFFCCG, the forgetting factor was set to
;
was chosen for RFFCCG; the dimension of RFFS was
. From
Figure 4, we observe that the performance of quadratic-based algorithms (i.e., RFFKLMS, QKRLS, and RFFCG) became worse in the non-Gaussian noise environment, while RFFMC, KRMC-NC, and RFFCCG always generated stable performance and achieved desirable performance when impulse noise appeared. Especially, the filtering performance of RFFCCG was very close to that of the recursive KRMC-NC algorithm and better than that of the SGD-based RFFMC algorithm.
Table 3 lists the detailed simulation results in terms of the dictionary size, steady-state MSE, and average consumed time. One also can observe that RFFCCG could produce the comparable filtering accuracy to KRMC-NC with less consumed time and storage requirements. Thus, RFFCCG is more efficient in the compared algorithms for the MG time series prediction.
4.2. Nonlinear System Identification
To further validate the superiority of RFFCCG, we chose the problem of nonlinear system identification, where the nonlinear system is of the form [
35]
where
denotes the output at discrete time
k,
, and
were configured as the initial values, and
denotes the coefficient vector. The setting for prediction task is shown as follows: the previous two values (i.e.,
) were used as the input vector to predict the current value
. We considered stationary and non-stationary scenarios in the following simulations. The data were corrupted by the noise model mentioned above and the Gaussian kernel with kernel parameter
was used for all the tested algorithms.
In the stationary case, the coefficient vector was fixed at
. The first 2000 data points were used for training and the additional 200 for testing. We compared the testing MSE of RFFCCG with those of RFFKLMS, QKRLS, RFFMC, KRMC-NC, and RFFCG due to their modest complexities and excellent performance under the stationary system. The parameters were chosen to obtain the best results as follows:
for RFFKLMS;
for QKLMS;
for RFFMC;
,
, and
for KRMC-NC;
for RFFCG;
and
for RFFCCG. To balance the accuracy and computational time,
was also configured for the dimension of RFFS. The learning curves of all the algorithms are shown in
Figure 5. In this case, the RFFCCG algorithm still had satisfactory prediction ability and achieved comparable performance to KRMN-NC and better performance than RFFMC, while others exhibited poor performance. This also means that RFFCCG has strong robustness against impulsive noises.
Table 4 shows the dictionary size, steady-state MSEs, and average consumed time of all algorithms. As can be clearly seen from
Figure 5 and
Table 4, RFFCCG consumed less time and achieved a faster convergence rate and higher filtering accuracy than the compared algorithms including RFFCG.
The tracking performance was evaluated in a non-stationary system where two different coefficient vectors were used for data generation as follows:
was selected in the first 2000 data, and
was set in the following 2000 data. We compared the testing MSE of RFFCCG with those of RFFKLMS, RFFMC, RFFCG, and RFFCCG due to their modest complexities and excellent performance in the non-stationary system. To compute the convergence curve, a total of 4000 data points were used for training with a sudden change at the 2001-th data point. Regarding the test process, 400 data points were used with a sudden change at the 201-th data point. With the same criterion of parameters setting, the step sizes
of RFFKLMS and RFFMC were chosen as 0.1 and 0.3, respectively, the forgetting factor
was used in RFFCG and RFFCCG, and the dimension of RFF was set as
. The performance comparison is presented in
Figure 6. It can be observed that all of the RFF-based algorithms were capable of tracking the change of the system. However, RFFCCG outperformed all the compared algorithms with
when abrupt change occurred. Note that the dictionary size, steady-state MSEs, and consumed time of the tested algorithms averaged by the points of
and
are summarized in
Table 5. As observed in
Figure 6 and
Table 5, RFFCCG provided good tracking performance for a non-stationary system in non-Gaussian noises.
Therefore, in both stationary and non-stationary circumstances for the nonlinear system identification, the proposed RFFCCG algorithm offers excellent filtering performance in terms of filtering accuracy, convergence rate, robustness, and computational and space complexities.