Correntropy-Based Constructive One Hidden Layer Neural Network

Mojtaba Nayyeri; Modjtaba Rouhani; Hadi Sadoghi Yazdi; Marko M. Mäkelä; Alaleh Maskooki; Yury Nikulin

doi:10.3390/a17010049

,

and

¹

Institute for Artificial Intelligence, University of Stuttgart, 70569 Stuttgart, Germany

²

Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad 1696700, Iran

³

Department of Mathematics and Statistics, University of Turku, 20014 Turku, Finland

^*

Author to whom correspondence should be addressed.

Algorithms2024, 17(1), 49;https://doi.org/10.3390/a17010049

This article belongs to the Section Algorithms for Multidisciplinary Applications

Version Notes

Order Reprints

Abstract

One of the main disadvantages of the traditional mean square error (MSE)-based constructive networks is their poor performance in the presence of non-Gaussian noises. In this paper, we propose a new incremental constructive network based on the correntropy objective function (correntropy-based constructive neural network (C2N2)), which is robust to non-Gaussian noises. In the proposed learning method, input and output side optimizations are separated. It is proved theoretically that the new hidden node, which is obtained from the input side optimization problem, is not orthogonal to the residual error function. Regarding this fact, it is proved that the correntropy of the residual error converges to its optimum value. During the training process, the weighted linear least square problem is iteratively applied to update the parameters of the newly added node. Experiments on both synthetic and benchmark datasets demonstrate the robustness of the proposed method in comparison with the MSE-based constructive network, the radial basis function (RBF) network. Moreover, the proposed method outperforms other robust learning methods including the cascade correntropy network (CCOEN), Multi-Layer Perceptron based on the Minimum Error Entropy objective function (MLPMEE), Multi-Layer Perceptron based on the correntropy objective function (MLPMCC) and the Robust Least Square Support Vector Machine (RLS-SVM).

Keywords:

information theoretic learning; probability theory; measure space; correntropy; non-Gaussian noise; constructive network; compact architecture; half-quadratic programming problem

1. Introduction

Non-Gaussian noises, especially impulse noise, and outliers are one of the most challenging issues in training adaptive systems including adaptive filters and feedforward networks (FFNs). The mean square error (MSE), the second-order statistic, is used widely as the objective function for adaptive systems due to its simplicity, analytical tractability and linearity of its derivative. The Gaussian noise assumption beyond MSE objective functions supposes that many real-world random phenomena may be modeled by Gaussian distribution. Under this assumption, MSE could be capable of extracting all information from data whose statistic is defined solely by the mean and variance [1]. Most real-world random phenomena do not have a normal distribution and the MSE-based methods may perform unsatisfactorily in such cases.

Several types of feedforward networks have been proposed by researchers. From the architecture viewpoint, these networks can be divided into four classes including fixed structure networks, constructive networks [2,3,4,5,6], pruned networks [7,8,9,10] and pruning constructive networks [11,12,13].

The constructive networks start with a minimum number of nodes and connections, and the network size is increased gradually. These networks may have an adjustment mechanism based on the optimization of an objective function. The following literature survey focuses on the single-hidden layer feedforward networks (SLFNs) and multi-hidden layer feedforward networks with incremental constructive architecture, which are trained based on the MSE objective function.

Fahlman and Lebier [2] proposed a cascade correlation network (CCN) in which new nodes are added and trained one by one, creating a multi-layer structure. The parameters of the network are trained to maximize the correlation between the output of the new node and the residual error. The authors in [3] proposed several objective functions for training the new node. They proved that the networks with such objective functions are universal approximators. Huang et al. [4] proposed a novel cascade network. They used the orthogonal least square (OLS) method to drive a novel objective function for training new hidden nodes. Ma and Khorasani [6] proposed a constructive one hidden layer feedforward network in which its hidden unit activation functions are Hermite polynomial functions. This approach results in a more efficient capture of the underlying input–output map. They proposed a new one hidden layer constructive adaptive neural network (OHLCN) scheme in which the input and output sides of the training are separated [5]. They scaled error signals during the learning process to achieve better performance. Inefficient input connections are pruned to achieve better performance. A new constructive scheme was proposed by Wu et al. [14] based on a hybrid algorithm, which is presented by combining the Levenberg–Marquardt algorithm and the least square method. In their approach, a new randomly selected neuron is added to the network when training is entrapped into local minima.

Inspired by information theoretic learning (ITL), correntropy, which is a localized similarity measure between two random variables [15,16], has recently been utilized as the objective function for training adaptive systems. Bessa et al. [17] employed maximum correntropy criterion (MCC) for training neural networks with fixed architecture. They compared the Minimum Error Entropy (MEE) and MCC-based neural networks with MSE-based networks and reported new results in wind power prediction. Singh and Principe [18] used correntropy as the objective function in the linear adaptive filter to minimize the error between the output of the adaptive filter and the desired signal, to adjust the filter weights. Shi and Lin [19] employed a convex combination scheme to improve the performance of the MCC adaptive filtering algorithm. They showed that the proposed method has better performance compared to the original signal filter algorithm. Zhao et al. [20] combined the advantage of Kernel Adaptive Filter and MCC and proposed Kernel Maximum Correntropy (KMC). The simulation results showed that KMC has significant performance in the noisy frequency doubling problem [20]. Wu et al. [21] employed MCC to train Hammerstein adaptive filters and showed that it provides a robust method in comparison to the traditional Hammerstein adaptive filters. Chen et al. [22] studied a fixed-point algorithm for MCC and showed that under sufficient conditions convergence of the fixed-point MCC algorithm is guaranteed. The authors in [23] studied the steady-state performance of adaptive filtering when MCC is employed. They established a fixed-point equation in the Gaussian noise condition to obtain the exact value of the steady-state excess mean square error (EMSE). In non-Gaussian conditions, using the Taylor expansion approach, they derived an approximate analytical expression for the steady-state EMSE. Employing stack auto-encoders and the correntropy-induced loss function, Chen et al. [24] proposed a robust deep learning model. The authors in [25], inspired by correntropy, proposed a margin-based loss function for classification problems. They showed that in their method, outliers that produce high error have little effect on discriminant function. In [26], the authors provided a learning theory analysis for the connection between the regression model associated with the correntropy-induced loss and the least square regression model. Furthermore, they studied its convergence property and concluded that the scale parameter provides a balance between the convergence rate of the model and its robustness. Chen and Principe [27] showed that maximum correntropy estimation is a smooth maximum a posteriori estimation. They also proved that when kernel size is larger than the special value and some condition is held, maximum correntropy estimation has a unique optimal solution due to the strictly concave region of the smooth posterior distribution. The authors in [28], investigated the approximation ability of a cascade network when its input parameters are calculated by the correntropy objective function with a sigmoid kernel. They reported that their method works better than the other methods introduced in [28] when data are contaminated by noise.

MCC with Gaussian kernel is a non-convex objective function that leads to local solutions for neural networks. In this paper, we propose a new method to overcome this bottleneck by adding hidden nodes one by one until the constructive network reaches a specific amount of predefined accuracy or reaches a maximum number of nodes. We prove that the correntropy of the constructive network constitutes a strictly increasing sequence after adding each hidden node and converging to its maximum.

This paper can be considered as an extension of [28]. While in [28] the correntropy measure was based on the sigmoid kernel in the objective function to adjust the input parameters of a newly added node in a cascade network, in this paper, the kernel in the correntropy objective function is changed from sigmoid to Gaussian kernel. This objective function is then used for training both input and output parameters of the new nodes in a single-hidden layer network. The proposed method performs better than [28] for two reasons: (1) the Gaussian kernel provides better results than the sigmoid kernel as it is a local similarity measure, and (2) in contrast to [28], in this paper, correntropy is used to train both the input and output parameters of each newly added node.

In a nutshell, the proposed method has the following advantages:

The proposed method is robust to non-Gaussian noises, especially impulse noise, since it takes advantage of the correntropy objective function. In particular, the Gaussian kernel provides better results than the sigmoid kernel. The reason for the robustness of the proposed method is discussed in Section 4 analytically, and in Section 5 experimentally.
Most of the methods that employ correntropy as the objective function to adjust their parameters suffer from local solutions. In the proposed method, the amount of correntropy of the network is increased by adding new nodes and converging to its maximum; thus, the global solution is provided.
The network size is determined automatically; consequently, the network does not suffer from over/underfitting, which results in satisfactory performance.

The structure of the remainder of this paper is as follows. In Section 2, some necessary mathematical notations, definitions and theorems are presented. Section 3 presents some related previous work. Then a correntropy-based constructive neural network (C2N2) is proposed in Section 4. Experimental results and a comparison with other methods are carried out in Section 5. The paper is concluded in Section 6.

2. Mathematical Notations, Definitions and Preliminaries

In this section, first, measure and function spaces that are necessary for describing previous work are defined in Section 2.1. Section 2.2 introduces the structure of the single-hidden layer feedforward network (SLFN) that is used in this paper, followed by its mathematical notations and definitions of its related variables.

2.1. Measure Space, Probability Space and Function Space

As mentioned in [3], let

X

be the input space that is a bounded measurable subset in

R^{d}

and

L^{2} (X)

be the space of all function f that is

\int_{X} {(f (x))}^{2} d μ (x) < \infty

. For

u, v \in L^{2} (X)

, the inner product is defined as follows:

{⟨ u, v ⟩}_{μ} : = \int_{X} u (x) v (x) d μ (x) .

where

μ

is a positive measure on input space. Under the measure

μ

, the

l 2

norm in

L^{2} (X)

space is denoted as

{| \cdot |}_{2}

. The closeness between u and v is measured by

{| u - v |}_{2} = {(\int_{X} {(u (x) - v (x))}^{2} d μ (x))}^{\frac{1}{2}}

The angle between u and v is defined by

θ_{u, v} : = arccos (\frac{{⟨ u, v ⟩}_{μ}}{{| u |}_{2} {| v |}_{2}})

Definition 1

([28,29]). Let W be a probability space that is a measure space with a total measure one. This space is represented as follows:

W = (Ω, F, P)

where Ω is its sample space. In this paper, Ω is considered a compact subset of

R^{d}, F

is a sigma-algebra of events and

P

is a probability measure that is a measure on

F

with

P (Ω) = 1

.

Definition 2

([28,29]). Let

L^{p} (Ω, F, P), 1 \leq p < \infty

be a set of all p - integrable random variables

X : Ω \to R

, i.e.,

{∥ X ∥}_{p} = {(\int_{Ω} X^{p} d P)}^{\frac{1}{p}} = {(E (X^{p}))}^{\frac{1}{p}} < \infty

This is a vector space and the inner product in this space is defined as follows:

⟨ X, Y ⟩ : = \int_{Ω} X (ω) Y (ω) d P = E (X Y)

where

X, Y \in L^{p} (Ω, F, P)

and

E (\cdot)

is expectation in probability theory. The closeness between two random variables X and Y is measured by

L^{p} (Ω, F, P)

norm:

\begin{matrix} {∥ X - Y ∥}_{p} = {(\int_{Ω} {| X (ω) - Y (ω) |}^{p} d P)}^{\frac{1}{p}} = \\ {(E ({| X - Y |}^{p}))}^{\frac{1}{p}} X, Y \in L^{p} (Ω, F, P) \end{matrix}

In ITL, the correlation between random variables is generalized to correntropy, which is a measure of similarity [15]. Let X and Y be two given random variables; the correntropy in the sense of [15] is defined as

V_{M} (X, Y) : = E (k_{M} (X, Y))

where

k_{M} (\cdot, \cdot)

is a Mercer kernel function.

In general, a Mercer kernel function is a type of positive semi-definite kernel function that satisfies Mercer’s condition. Formally, a symmetric function

k_{M}

is a Mercer kernel function if, for any positive integer m and any set of random variables

X_{1}, \dots, X_{m} \in L^{p} (Ω, F, P)

, the corresponding Gram matrix

K_{i j} = k_{M} (X_{i}, X_{j})

is positive semi-definite.

In the definition of correntropy,

E (\cdot)

implies the expected value of the random variable and

M

is replaced by

α

or

σ

if the sigmoid or Gaussian kernel (radial basis) are used, respectively. In our recent work [28], we use a sigmoid kernel, which is defined as

k_{α} (X, Y) = tanh (α ⟨ X, Y ⟩ + c),

where

α, c \in R

are scale and offset hyperparameters of the sigmoid kernel. The offset parameter c in the sigmoid kernel influences the shape of the kernel function. A higher value of c leads to a steeper sigmoid curve, making the kernel function more sensitive to variations in the input space. It is important to note that the choice of hyperparameters, including c and

α

, can significantly impact the performance of a machine learning model using the sigmoid kernel. These parameters are often tuned during the training process to optimize the model for a specific task or dataset.

In contrast to [28], in this paper, we use the Gaussian kernel that is represented as

k_{σ} (X, Y) = \frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥ X - Y ∥}^{2}}{2 σ^{2}})

where

σ

is the variance for the Gaussian function. Here,

σ

controls the width of the Gaussian kernel. A larger

σ

results in a smoother and more slowly decaying kernel, while a smaller

σ

leads to a narrower and more rapidly decaying kernel.

Let error function be defined as

e : = e (X, Y) \equiv X - Y

; the correntropy of the error function is represented as

V_{σ} (e) = E (k_{σ} (X, Y)) .

At the end of this subsection, we note that, alternatively, the Wasserstein distance, also known as the Earth Mover’s Distance (EMD), Kantorovich–Rubinstein metric, Mallows’s distance or optimal transport distance, can be used as a metric that quantifies the minimum cost of transforming one probability distribution into another and can therefore be used to quantify the rate of convergence when the error is measured in some Wasserstein distance [30]. The relationship between correntropy and Wasserstein distance is often explored in the context of kernelized Wasserstein distances. By using a kernel function, the Wasserstein distance can be defined in a reproducing kernel Hilbert space (RKHS). In this framework, correntropy can be seen as a special case of a kernelized Wasserstein distance when the chosen kernel is the Gaussian kernel.

2.2. Network Structure

This paper focuses on the single-hidden layer feedforward network. As shown in Figure 1, it has three layers, including the input layer, the hidden layer and the output layer. Without loss of generality, this paper considers the SLFN with only one output node.

Figure 1. SLFN with additive nodes.

The output of SLFN with L hidden nodes is represented as follows [3]:

f_{L} = \sum_{i = 1}^{L} β_{i} g_{i} (x)

where

g_{i}

is the i-th hidden node and can be one of the two following types:

For additive nodes

g_{i} (x) = g ({⟨w_{i}, x⟩}_{μ} + b_{i}); w_{i} \in R^{d} and b_{i} \in R

2.: For RBF nodes

g_{i} (x) = g (\frac{{|x - w_{i}|}_{2}}{b_{i}}); w_{i} \in R^{d} and b_{i} \in R^{+}

For additive nodes, the vector

w_{i}

is the input weights of the i-th hidden node and

b_{i}

is its bias. For RBF nodes, the vector

w_{i}

is the center of the i-th radial basis function and

b_{i}

is its impact factor.

All networks that can be generated are represented as the following functions set [3]:

O = ⋃_{L = 1}^{\infty} O_{L}

where

O_{L} = \{f_{L} ∣ f_{L} (x) = \sum_{i = 1}^{L} β_{i} g_{i} (x); β_{i} \in R, g_{i} (x) \in G\}

and

G

is a set of all possible hidden nodes. For additive nodes, we have

G = \{g ({⟨ w, x ⟩}_{μ} + b); w \in R^{d}, b \in R\}

For the RBF case, we have

G = \{g (\frac{{| x - w |}_{2}}{b}); w \in R^{d}, b \in R^{+}\}

Let f be a target function that is approximated by the network with L hidden nodes. The network residual error function is defined as follows:

e_{L} : = f - f_{L},

In practice, the function form of error is not available and the network is trained on finite data samples, which are described as

X = {\{x_{i}, y_{i}\}}_{i = 1}^{N}

, where

x_{i} \in R^{d}

is the d dimension input vector of the i-th training sample and

y_{i} \in R

is its target value. Thus, the error vector on the training samples is denoted as follows:

E_{L} = (E_{L 1}, \dots, E_{L N}),

where

E_{L i}

is the error of the i-th training sample for the network with L hidden nodes

(E_{L i} = e_{L} (x_{i}))

. Furthermore, the activation vector for the L-th hidden node is

G_{L} = (G_{L 1}, \dots, G_{L N})

where

G_{L i}

is the output of the L-th hidden nodes for the i-th training sample

(G_{L i} = g_{L} (x_{i}))

.

3. Previous Work

There are several types of constructive neural networks. In this section, the networks that are proposed in [3,28] are introduced. In those methods, the network is constructed by adding a new node to the network in each step. The training process of the newly added node (L-th hidden node) is divided into two phases: the first phase is devoted to adjusting the input parameters and the second phase is devoted to adjusting the output weight. When the parameters of the new node are obtained, they are fixed and do not change during the training of the next nodes.

3.1. The Networks Introduced in [3]

For the input parameters’

(w_{L}, b_{L})

adjustment in [3], several objective functions are proposed to adjust the input parameters of the newly added node in the constructive network. They are as follows [3]:

\begin{matrix} V_{1} = {(\frac{E_{L - 1} G_{L}^{T}}{G_{L} G_{L}^{T}})}^{2}, \\ V_{2} = {(E_{L - 1} G_{L}^{T})}^{2}, \\ V_{3} = {(\frac{(E_{L - 1} - {\bar{E}}_{L - 1}) {(G_{L} - {\bar{G}}_{L})}^{T}}{∥G_{L} - {\bar{G}}_{L}∥})}^{2}, \\ V_{4} = \sqrt{V_{1}}, \\ V_{5} = \sqrt{V_{2}}, \\ V_{6} = \sqrt{V_{3}}, \\ V_{CasCor} = {((E_{L - 1} - {\bar{E}}_{L - 1}) {(G_{L} - \bar{G_{L}})}^{T})}^{2}, \end{matrix}

where

V_{CasCor}

is the objective function for the cascade correlation network,

{\bar{E}}_{L - 1} = \frac{1}{N} \sum_{i = 1}^{N} E_{L - 1} (x_{i})

. The objective function that is used to adjust the output weight of the L-th hidden node is [3]

Δ_{L} = E_{L - 1} E_{L - 1}^{T} - E_{L} E_{L}^{T}

and

Δ_{L}

is maximized if and only if [3]

β_{L} = \frac{E_{L - 1} G_{L}^{T}}{G_{L} G_{L}^{T}}

which is the optimum output parameter of the new node. In [3], the authors also proved that for each of the objective functions

V_{1}

to

V_{6}

, the network error converges.

Theorem 1

([3]). Given span (G) is dense in

L^{2}

and

\forall g \in {G, 0 < | g |}_{2} < b

for some

b \in R

. If

g_{L}

is selected so as to

maximize {(\frac{{(e_{L - 1}, g_{L}⟩}_{μ}}{{|g_{L}|}_{2}})}^{2}

, then

{lim}_{L \to \infty} {|f - f_{L}|}_{2} = 0

.

More detailed discussion about theorems and their proofs can be found in [3].

3.2. Cascade Correntropy Network (CCOEN) [28]

The authors in [28] proved that if the input parameters of each new node in a cascade network are assessed by using the correntropy objective function with the sigmoid kernel and its output parameter is adjusted by

β_{L} = \frac{E (e_{L - 1} g_{L})}{E (g_{L}^{2})}

then the network is a universal approximator. The following theorem investigates the approximation ability of CCOEN:

Theorem 2

([28]). Suppose

span (g)

is dense in

P^{2} (Ω, F, G)

. For any continuous function f and for the sequence of error similarity feedback functions

\{g_{L}^{s (e)}\}, L \in N

, there exists a real sequence

\{η_{L}; L \in N\}

such that

lim_{L \to \infty} E (e_{L}^{2}) = 0

holds with probability one if

\begin{matrix} g_{L}^{s (e)} = {argmax}_{g_{L} \in g} V_{σ} (e_{L - 1}, η_{L} g_{L}) \\ β_{L} = \frac{E (e_{L - 1} g_{L}^{s (e)})}{E ({(g_{L}^{s (e)})}^{2})} \end{matrix}

It was shown that CCOEN is more robust than the networks proposed in [3] when data are contaminated by noise.

4. Proposed Method

In this section, a novel constructive neural network is proposed based on the maximum correntropy criterion with the Gaussian kernel. To the best of our knowledge, it is the first time that correntropy with Gaussian kernel is employed as the objective function for training both the input and output weights of a single-hidden layer constructive network. It must be considered that the correntropy is a non-convex objective function and it is difficult to adjust the optimum solution. This section proposes a new theorem and surprisingly proves that the proposed method that is trained by using the correntropy objective function converges to the global solution. It is shown that the performance of the proposed method is excellent in the presence of non-Gaussian noise, especially impulse noise. In the proposed network, hidden nodes are added and trained one by one and the parameters of the newly added ( L-th hidden node) nodes are obtained and then fixed (see Figure 2).

Figure 2. Constructive network in which the last added node is referred to by L.

This section is organized as follows: First, some preliminaries, mathematical definitions and theorems that are necessary for presenting the proposed method and proving the convergence of the method are introduced in Section 4.1. The new training strategy for the proposed method is described in Section 4.2. In Section 4.3, the convergence of the proposed method is proven when the error and activation function are continuous random variables. In practice and during the training on the dataset, the error function is not available; thus, the error vector and activation vector are used to train the new node. Regarding this fact, in Section 4.1, two optimization problems are presented to adjust the parameters of the new node based on training data samples.

4.1. Preliminaries for Presenting the Proposed Method

This section presents a new theorem for the proposed method based on special spaces, which are defined in Definitions 1 and 2.

The following lemmas, propositions and theorems are also used in the proof of the main theorem.

Lemma 1

([31]). Given

g : R \to R, Span \{g (⟨ w, x ⟩ + b), (w, b) \in R^{d} \times R\}

is dense in

L^{p}

for every

p \in [1, \infty)

, if and only if g is not a polynomial (almost everywhere).

Proposition 1

([32]). For

G (z) = exp (- \frac{{∥ z ∥}^{2}}{2 σ^{2}})

, there exists a convex conjugated function ϕ, such that

G (z) = sup_{α \in R^{-}} (α \frac{{∥ z ∥}^{2}}{2 σ^{2}} - ϕ (α))

Moreover, for a fixed z, the supremum is reached at

α = - G (z)

.

Theorem 3

([33]). If

X_{n}

is any sequence of random variables which are positive (take values in

[0, \infty)

, increasingly converge

(X_{n} (ω) ↑ X (ω))

for any

ω \in Ω)

, and the expectation exists

(X_{n} \in L^{1}

for all n ), then

E (X_{n}) \to E (X)

.

Theorem 4

([33]). (Monotonicity) Let X and Y be random variables with

X \leq Y

, then

E (X) \leq E (Y)

, with equality if and only if

X = Y

almost surely.

Theorem 5

([34]). (Convergence) Every upper bounded increasing sequence converges to its supremum.

4.2. C2N2: Objective Function for Training the New Node

In this subsection, we combine the idea of constructive SLFN with the idea of correntropy and propose a new strong constructive network that is robust to impulsive noise. The proposed method employs correntropy as the objective function to adjust the input

(w_{L}, b_{L}, L = 1, \dots, \infty)

and output

(β_{L}, L = 1, \dots, \infty)

parameters of the network. To the best of our knowledge, it is the first time that correntropy with the Gaussian kernel has been employed for training all the parameters of a constructive SLFN. C2N2 starts with zero hidden nodes. The first hidden node is added to the network. First, the input parameters

(w_{1}, b_{1})

of the hidden node are calculated by employing a correntropy objective function with a Gaussian kernel. Then, they are fixed and the output parameter

(β_{1})

of the node is adjusted by the correntropy objective function with Gaussian kernel. After the parameters of the first node are obtained, they are then fixed and the next hidden node is added to the network and trained. This process is iterated until the stopping condition is satisfied.

The proposed method can be viewed as an extension of CCOEN [28] with the following differences:

In contrast to CCOEN, which uses correntropy with a sigmoid kernel to adjust the input parameters of a cascade network, the proposed method uses correntropy with a Gaussian kernel to adjust the whole parameters of an SLFN.
CCOEN uses correntropy to adjust the input parameters of the new node in a cascade network to provide a more robust method. However, the output parameter of the new node in a cascade network is still adjusted based on the least mean square error. In contrast, the proposed method uses correntropy with Gaussian kernel to obtain both the input and output parameters of the new node in a constructive SLFN. Therefore, the proposed method is more robust than CCOEN and other networks introduced in [3] when the dataset is contaminated by impulsive noise.
Employing Gaussian kernel for correntropy as the objective function to adjust the network’s parameters provides a closed-form formula introduced in the next section. In other words, both the input and output parameters are adjusted by two closed-form formulas.

For the proposed network, each newly added node (L-th added node where

L = 1, \dots, \infty)

is trained using the two phases.

In the first phase, the new node is selected from

G

, using the following optimization problem: where

g_{L}^{sim (e)} = {argmax}_{g_{L} \in G} \{V (g_{L})\},

V (g_{L}) = E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - k_{L} g_{L}∥}^{2}}{2 σ^{2}})) .

From the definition of the kernel, the most similar

(g_{L}^{sim (e)})

activation function to the residual error of L− 1 nodes network is selected from

G

as this node is selected to maximize:

\begin{matrix} V (g_{L}) = & E (⟨Φ (e_{L - 1}), Φ (k_{L} g_{L})⟩) \\ , k_{L} \in R - {0} \end{matrix}

where

Φ

is feature mapping. Consequently, the biggest reduction in error is obtained and the network has a more compact architecture.

In the second phase, the output parameter

(β_{L})

of the new node is adjusted whereby

β_{L}^{sim (e)} = {argmax}_{β_{L}} \{V (β_{L}), β_{L} \in R\},

V (β_{L}) = E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - β_{L} g_{L}^{sim (e)}∥}^{2}}{2 σ^{2}})) .

These two phases are iterated and a new node is added in each iteration until the certain stopping condition is satisfied. This is discussed in Section 5.

After the parameters of the new node are tuned, the correntropy of the residual error (error of the network with L hidden nodes) is shown as

V (e_{L}) = E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - β_{L}^{sim (e)} g_{L}^{sim (e)}∥}^{2}}{2 σ^{2}})),

and the residual error is updated as follows:

e_{L} = e_{L - 1} - β_{L}^{sim (e)} g_{L}^{sim (e)} .

It is important to note that this subsection only presents two optimization problems for adjusting the input and output parameters of the new node. In Section 4.4, we present a way to solve these problems.

4.3. Convergence Analysis

In this subsection, we prove that the correntropy of the newly constructed network undergoes a strictly increasing sequence and converges to its supremum. Furthermore, it is proven that the supremum equals the maximum. To prove the convergence of the correntropy of the network, the definitions, theorems and lemma that are presented in Section 4.1 are employed. To prove convergence of the proposed method, similarly to [3,28], we propose the following lemma and prove that the new node, which is obtained from the input side optimization problem, is not orthogonal to the residual error function.

Lemma 2.

Given

span (G)

is dense in

L^{2} (Ω, F, P)

and

e_{L - 1} \in L^{2} (Ω, F, P)

. There exists a real number

k_{L} \in R - {0}

such that

g_{L}^{sim (e)}

is not orthogonal to

e_{L - 1}

, where

g_{L}^{sim (e)} = {argmax}_{g_{L} \in G} (E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - k_{L} g_{L}∥}^{2}}{2 σ^{2}}))) .

Employing Lemma 2 and what is mentioned in Section 4.1, the following theorem proves that the proposed method achieves its global solution.

Theorem 6.

Given an SLFN with

t a n h

(tangent hyperbolic) function for the additive nodes, for any continuous function f and for the sequence of hidden nodes functions, obtained based on the residual error functions, i.e.,

\{g_{L}^{sim (e)}\}, L \in N

, there exists a real sequence

\{k_{L}; L \in N\}

such that

lim_{L \to \infty} V (e_{L}) = V_{max}

holds almost everywhere, provided that

\begin{matrix} g_{L}^{sim (e)} = {argmax}_{g_{L} \in G} \{V (g_{L})\} . \\ β_{L}^{sim (e)} = {argmax}_{β_{L} \in R} \{V (β_{L})\}, \end{matrix}

where

V_{max} = \frac{1}{\sqrt{2 π} σ} .

The proof of Lemma 2 and Theorem 6 contain some pure mathematics contents and are placed in Appendix A.

4.4. Learning from Data Samples

In Theorem 6, we proved that the proposed network, i.e., the one hidden layer constructive neural network based on correntropy (C2N2), achieves an optimal solution. During the training process, the function form of the error is not available and the error and activation vectors are generated from the training samples. In the rest of this subsection, we propose a method to train the network from data samples.

4.4.1. Input Side Optimization

The optimization problem to adjust the input parameters is as follows:

V_{L}^{g} = max_{G_{L}} (E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥E_{L - 1} - k_{L} G_{L}∥}^{2}}{2 σ^{2}}))) .

On training data, expectation can be approximated as:

{\hat{V}}_{L}^{g} = max_{G_{L}} (\frac{1}{N \sqrt{2 π} σ} \sum_{i = 1}^{N} (exp (- \frac{{∥E_{(L - 1) i} - k_{L} G_{L i}∥}^{2}}{2 σ^{2}}))) .

The constant term

\frac{1}{N \sqrt{2 π} σ}

can be removed and the following problem can be solved instead

{\hat{V}}_{L}^{g}

:

{\hat{U}}_{L}^{g} = max_{G_{L}} (\sum_{i = 1}^{N} (exp (- \frac{{∥E_{(L - 1) i} - k_{L} G_{L i}∥}^{2}}{2 σ^{2}}))) .

Consider the following equality:

E_{(L - 1) i} = k_{L} G_{L i} i = 1, \dots, N .

In this paper, the tanh function is selected as the activation function, which is bipolar and invertible. Therefore:

g^{- 1} (\frac{E_{(L - 1) i}}{k_{L}}) = X_{i} W_{(d + 1) * 1} i = 1, \dots, N,

where

X_{i} = [\begin{matrix} x_{i} & 1 \end{matrix}], W_{(d + 1) * 1} = [\begin{matrix} W_{L} \\ b_{L} \end{matrix}]

,

X = {[X_{1}^{T}, \dots, X_{N}^{T}]}^{T}

in which

a b s (.)

is the absolute function. The range of g (domain of

g^{- 1})

is

[- 1, 1]

. Thus, it is necessary to rescale the error signal to be in the range. To do so,

k_{L}

is assigned as follows:

k_{L} = \frac{max (a b s (E_{L}))}{λ^{'}},

where

λ^{'} \in (- 1, 1) - {0}

. Let

H_{L i} = g^{- 1} (\frac{E_{(L - 1) i}}{k_{L}})

, and therefore, the term

{∥E_{(L - 1) i} - k_{L} G_{L i}∥}^{2}

can be replaced by

{∥H_{L i} - X_{i} W_{L}∥}^{2},

and thus the following problem is presented to adjust the input parameters

{\hat{U}}_{L}^{g} = max_{W_{L}} (\sum_{i = 1}^{N} (exp (- \frac{{∥H_{L i} - X_{i} W_{L}∥}^{2}}{2 σ^{2}}))) .

To achieve better generalization performance, the norm of the weights needs to be kept minimized too; thus, the problem above is reformulated as

{\hat{U}}_{L}^{g} = max_{W_{L}} (\sum_{i = 1}^{N} (exp (- \frac{{∥H_{L i} - X_{i} W_{L}∥}^{2}}{2 σ^{2}})) - \frac{C}{2} {∥W_{L}∥}^{2}) .

It should be considered that if

C = 0

, both problems from above are equivalent. Since the necessary condition for convergence is that in each step and by adding each node amount of correntropy of the error should be increased, in the experiment section,

C \approx 0

and other amounts for C are checked and the best result is selected. This guarantees convergence of the method according to Theorem 6.

The half-quadratic method is employed to adjust the input parameters. Based on Proposition 1, we have

⊓_{L}^{g} (α, W_{L}) = max_{W_{L}, α} (\sum_{i = 1}^{N} (α_{i} \frac{{∥H_{L i} - X_{i} W_{L}∥}^{2}}{2 σ^{2}} - Φ (α_{i})) - \frac{C}{2} {∥W_{L}∥}^{2}) .

The local solution of the above optimization problem is adjusted using the following iterative process:

\{\begin{matrix} α_{i}^{t + 1} = - G (H_{L i} - X_{i} W_{L}) \\ W_{L}^{t + 1} = {arg}_{W_{L}} max (\sum_{i = 1}^{N} (α_{i}^{t + 1} \frac{{∥H_{L i} - X_{i} W_{L}∥}^{2}}{2 σ^{2}}) - \frac{C}{2} {∥W_{L}∥}^{2}) \end{matrix},

i.e., the following optimization problem needs to be solved in each iterate:

V_{L}^{g} (α, W_{L}) = max_{W_{L}} (\sum_{i = 1}^{N} (α_{i}^{t + 1} \frac{{∥H_{L i} - X_{i} W_{L}∥}^{2}}{2 σ^{2}}) - \frac{C}{2} {∥W_{L}∥}^{2}) .

Since

σ^{2}

is a constant term, it can be removed from the optimization problem. Then, the optimization problem can be multiplied by

\frac{1}{C}

. We set

C^{'} = \frac{1}{C}

. Thus, the following constraint optimization problem is obtained:

\{\begin{matrix} max \sum_{i = 1}^{N} \frac{C^{'}}{2} (α_{i}^{t + 1} ξ_{i}^{2}) - \frac{1}{2} {∥W_{L}∥}^{2} \\ s . t . X_{i} W_{L} = H_{L i} - ξ_{i} i = 1, \dots, N \end{matrix}

The Lagrangian is constituted as

L (ξ_{i}, η_{i}, W_{L}) = \sum_{i = 1}^{N} \frac{C^{'}}{2} (α_{i}^{t + 1} ξ_{i}^{2}) - {∥W_{L}∥}^{2} - \sum_{i = 1}^{N} η_{i} (X_{i} W_{L} - H_{L i} + ξ_{i}) .

The derivations of the Lagrangian function with respect to its variables are the following

\frac{\partial L}{\partial W_{L}} = 0 \to W_{L} = - \sum_{i = 1}^{N} η_{i} X_{i}^{T} = - X^{T} η,

where

η = {[η_{1}, \dots, η_{N}]}^{T}

.

\begin{matrix} \frac{\partial L}{\partial ξ_{i}} = 0 \to η_{i} = C^{'} α_{i}^{t + 1} ξ_{i} i = 1, \dots, N, \\ \frac{\partial L}{\partial η_{i}} = 0 \to X_{i} W_{L} - H_{L i} + ξ_{i} = 0 i = 1, \dots, N . \end{matrix}

Now we consider two cases.

Case 1.

d \leq N

By substituting derivatives in

W_{L} = - \sum_{i = 1}^{N} C^{'} α_{i}^{t + 1} ξ_{i} X_{i}^{T}

we obtain

\begin{matrix} W_{L} = - \sum_{i = 1}^{N} C^{'} α_{i}^{t + 1} (- X_{i} W_{L} + H_{L i}) X_{i}^{T} \\ = \sum_{i = 1}^{N} C^{'} α_{i}^{t + 1} (X_{i} W_{L}) X_{i}^{T} - \sum_{i = 1}^{N} C^{'} α_{i}^{t + 1} H_{L i} X_{i}^{T} . \end{matrix}

Let

Ψ

be a diagonal matrix with

Ψ_{i i} = α_{i}

; therefore,

\begin{matrix} W_{L} - (C^{'} X^{T} Ψ X W_{L}) = - C^{'} X^{T} Ψ H_{L} \\ W_{L} (I - C^{'} X^{T} Ψ X) = - C^{'} X^{T} Ψ H_{L} \\ W_{L} = {(X^{T} Ψ X - \frac{I}{C^{'}})}^{- 1} X^{T} Ψ H_{L} . \end{matrix}

Case 2.

d \geq N

By substituting derivatives in

X W_{L} - H_{L} + ξ = 0 \to - X X^{T} η - H_{L} + ξ = 0,

η = C^{'} Ψ ξ .

we obtain

\begin{matrix} - X X^{T} C^{'} Ψ ξ - H_{L} + ξ = 0 \\ \to - X X^{T} C^{'} Ψ ξ + ξ = H_{L} \\ \to (- C^{'} X X^{T} Ψ + I) ξ = H_{L} \\ \to (- X X^{T} Ψ + \frac{I}{C^{'}}) C^{'} ξ = H_{L} \\ \to C^{'} ξ = {(- X X^{T} Ψ + \frac{I}{C^{'}})}^{- 1} H_{L} \end{matrix}

Then

\begin{matrix} - {(X^{T})}^{+} W_{L} = Ψ {(- X X^{T} Ψ + \frac{I}{C^{'}})}^{- 1} H_{L} \\ \to W_{L} = X^{T} Ψ {(X X^{T} Ψ - \frac{I}{C^{'}})}^{- 1} H_{L} \end{matrix}

Thus, the input parameters are obtained by the following iterative process:

\{\begin{matrix} α_{i}^{t + 1} = - G (H_{L i} - X_{i} W_{L}^{t}) \\ W_{L}^{t + 1} = {(X^{T} Ψ^{t} X - \frac{I}{C^{'}})}^{- 1} X^{T} Ψ^{t} H_{L} \\ or \\ W_{L}^{t + 1} = X^{T} Ψ^{t} {(X X^{T} Ψ^{t} - \frac{I}{C^{'}})}^{- 1} H_{L} \end{matrix}

4.4.2. Output Side Optimization

When the input parameters of the new node are obtained from the previous step, the new node is named

G_{L}^{sim (e)} = \{G_{L 1}^{sim (e)}, \dots, G_{L N}^{sim (e)}\}

, where

G_{L i}^{sim (e)} = g_{L}^{sim (e)} (x_{i})

and the output parameter is adjusted using the following optimization problem:

V_{L}^{β} = max_{β_{L}} (E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - β_{L} g_{L}^{sim (e)}∥}^{2}}{2 σ^{2}}))) .

The expectation can be approximated on training samples:

{\hat{V}}_{L}^{β} = max_{β_{L}} (\frac{1}{N \sqrt{2 π} σ} \sum_{i = 1}^{N} (exp (- \frac{{∥E_{(L - 1) i} - β_{L} G_{L i}^{sim (e)}∥}^{2}}{2 σ^{2}}))) .

The constant term

\frac{1}{N \sqrt{2 π} σ}

can be removed and the following problem can be solved instead

{\hat{V}}_{L}^{β}

:

{\hat{U}}_{L}^{β} = max_{β_{L}} (\sum_{i = 1}^{N} (exp (- \frac{{∥E_{(L - 1) i} - β_{L} G_{L i}^{s i m (e)}∥}^{2}}{2 σ^{2}}))) .

Similar to the previous step, the half-quadratic method is employed to adjust the output parameter. Based on Proposition 1, we obtain

u_{L}^{β} (γ, β_{L}) = max_{β_{L}, γ} (\sum_{i = 1}^{N} (γ_{i} \frac{{∥E_{(L - 1) i} - β_{L} G_{L i}^{sim (e)}∥}^{2}}{2 σ^{2}} - Φ (γ_{i}))) .

The local solution of the above optimization problem is adjusted using the following iterative process:

\{\begin{matrix} γ_{i}^{t + 1} = - k_{σ} (E_{L i} - β_{L}^{t} G_{L i}^{sim (e)}) \\ β_{L}^{t + 1} = {arg}_{β_{L}} max (\sum_{i = 1}^{N} (γ_{i}^{t + 1} \frac{∥ E_{L i} - β_{L}^{t} G_{L i}^{s i m} (e)}{2 σ^{2}})) \end{matrix} .

i.e., the following optimization problem is required to be solved in each iteration:

\begin{matrix} V_{L}^{β} (γ, β_{L}) = max_{β_{L}} (\sum_{i = 1}^{N} (γ_{i}^{t + 1} \frac{{∥E_{L i} - β_{L} G_{L i}^{sim (e)}∥}^{2}}{2 σ^{2}})), \\ ν_{L}^{β} (γ, β_{L}) = max_{β_{L}} (\frac{1}{2 σ^{2}} ((E_{L} - β_{L} G_{L}^{sim (e)}) Θ {(E_{L} - β_{L} G_{L}^{sim (e)})}^{T})) . \end{matrix}

where

Θ

is a diagonal matrix with

Θ_{i i} = γ_{i}, i = 1, \dots, N

.

V_{L}^{β} (γ, β_{L}) = max (\frac{1}{2 σ^{2}} (E_{L} Θ E_{L}^{T} + β_{L}^{2} G_{L}^{sim (e)} Θ G_{L}^{sim {(e)}^{T}} - β_{L} E_{L} Θ G_{L}^{sim {(e)}^{T}} - β_{L} G_{L}^{sim (e)} Θ E_{L}^{T})) .

The optimum output weight is adjusted by differentiating

V_{L}^{β} (γ, β_{L})

with respect to

β_{L}

as

\begin{matrix} (2 β_{L} G_{L}^{sim (e)} Θ G_{L}^{sim {(e)}^{T}} - E_{L} Θ G_{L}^{sim {(e)}^{T}} - G_{L}^{sim (e)} Θ E_{L}^{T}) = 0, \\ β_{L} = \frac{E_{L} Θ G_{L}^{sim {(e)}^{T}}}{G_{L}^{sim (e)} Θ G_{L}^{sim {(e)}^{T}}} . \end{matrix}

Finally, the output weight is adjusted by the following iterative process:

\{\begin{matrix} γ_{i}^{t + 1} = - k_{σ} (E_{L i} - β_{L}^{t} G_{L i}^{s i m (e)}) \\ β_{L}^{t + 1} = \frac{E_{L} Θ G_{L}^{sim {(e)}^{T}}}{G_{L}^{sim (e)} Θ G_{L}^{sim {(e)}^{T}}} \end{matrix} .

In these two phases, the parameters of the new node (L-th added node where

L \in N

) are tuned and then fixed. This process is iterated for each new node until the predefined condition is satisfied. The following proposition demonstrates that for each node, the algorithm converges.

Proposition 2.

The sequences

\{⊓_{L}^{g} (α^{t}, W_{L}^{t}), t = 1, 2, \dots\}

and

\{U_{L}^{β} (γ^{t}, β_{L}^{t}), t = 1, 2, \dots\}

converge.

Proof.

From Theorem 5 and Proposition 1, we have

u_{L}^{β} (γ^{t}, β_{L}^{t}) \leq u_{L}^{β} (γ^{t + 1}, β_{L}^{t})

\leq u_{L}^{β} (γ^{t + 1}, β_{L}^{t + 1})

and

(⊓_{L}^{g} (α^{t}, W_{L}^{t}) \leq u_{L}^{g} (α^{t + 1}, W_{L}^{t}) \leq

u_{L}^{g} (α^{t + 1}, W_{L}^{t + 1}))

. Thus, the non-decreasing sequence

\{u_{L}^{β} (γ^{t}, β_{L}^{t}), t = 1, 2, \dots\} (\{u_{L}^{g} (α^{t}, W_{L}^{t}), t = 1, 2, \dots\})

converges since the correntropy is upper bounded. □

Proposition 3.

When

Θ = I

, the output weight that is adjusted by the correntropy criterion is equivalent to the output weight that is adjusted by the MSE-based method such as IELM.

Proof.

Suppose that

Θ = I

, by

β_{L} = \frac{E_{b} θ G_{L}^{sim {(e)}^{T}}}{G_{L}^{sim (e)} Θ G_{L}^{sim {(e)}^{T}}}

, we have

β_{L} = \frac{E_{L} G_{L}^{sim {(e)}^{T}}}{G_{L}^{sim (e)} G_{L}^{sim {(e)}^{T}}} .

□

The training process of the proposed method is summarized in the following Algorithm 1 (C2N2).

Algorithm 1 C2N2

Input: training samples $χ = {x_{i}, y_{i}}_{i = 1}^{N}$
Output: Optimal input and output weights $β_{L}, W_{L}, L = 1, . . ., L_{m a x}$
$I n i t i a l i z a t i o n$ : Maximum number of hidden nodes $L_{m a x}$ , regularization term $C^{'}$ , maximum input side and output side iterations $I T 1, I T 2$ , error $E_{0} = [y_{1}, . . . y_{N}]$ .
$For L = 1 : L_{m a x}$
$S t e p 1$ : Calculate $H_{L}$ and X
$For k = 1 : I T 1$
Update input parameters
$End$
$S t e p 2$ : Calculate the hidden node vector, $G_{L}^{s i m (e)}$ by previous step
$For k = 1 : I T 2$
Update output weight
$End$
Update error as $E_{L} = E_{L - 1} - β_{L}^{s i m (e)} G_{L}^{s i m (e)}$
$End$

Remark 1.

The auxiliary variables

γ_{i}

and

α_{i}, i = 1, \dots, N

are utilized to reduce the effect of noisy data. For the samples with a high amount of error, these variables are very small; thus, these samples have slight effects on the optimization of the parameters of the network, which results in a more robust network.

5. Experimental Results

This section compares C2N2 with RBF, CCN and other constructive networks that are presented in [3]. The networks, whose hidden nodes’ input parameters are trained by the objective functions

V_{1}, V_{2}, V_{3}, \sqrt{V_{1}}, \sqrt{V_{2}}, \sqrt{V_{3}}

that are introduced in [3], are denoted by

N_{1}, \dots, N_{6}

. In addition to the mentioned methods, the proposed method is compared to the state-of-the-art constructive networks such as the orthogonal least square cascade network (OLSCN) [4] and the one hidden layer constructive network (OHLCN) introduced in [5]. Moreover, C2N2 is also compared with state-of-the-art robust learning methods including Multi-Layer Perceptron based on MCC (MLPMCC) [17], Multi-Layer Perceptron based on Minimum Error Entropy (MLPMEE) [17] and Robust Least Square Support Vector Machine (RLS-SVM) [35] and the recent work, CCOEN [28].

The rest of this section is organized as follows. Section 5.1 describes a framework for the experiments. The presented theorem and hyperparameters

(L, C^{'}

and

η^{'})

are investigated in Section 5.2. In Section 5.3, the presented method is compared to N1-N6, CCN, RBF and some state-of-the-art constructive networks including OHLCN and OLSCN. Experiments are performed on several synthetic and benchmark datasets that are contaminated with impulsive noise (one of the most popular types of non-Gaussian noise). In this part, experiments are also performed in the absence of impulsive noise. Section 5.4 compares the proposed method with state-of-the-art robust learning methods including MLPMEE, MLPMCC, RLS-SVM and CCOEN on various types of datasets.

5.1. Framework for Experiments

This part presents a framework for the experiments. The framework includes the type of activation function for C2N2 and other mentioned methods, type of kernel, kernel parameters

(σ)

, range of hyperparameters

(L, C^{'}, λ^{'})

and dataset specification.

5.1.1. Activation Function and Kernel

For the proposed method, the tangent hyperbolic activation function is used. It is represented as follows (see Figure 3):

tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

Figure 3. Tangent hyperbolic function.

For networks

N_{1} - N_{6}

, CCN, OLSCN, OHLCN, MLPMCC and MLPMEE, the sigmoid activation function is used. This function is represented and displayed as follows (see Figure 4):

Figure 4. Sigmoid function.

For the proposed method and RLS-SVM, RBF kernel is used. It is shown as

K (X, Y) = exp (- \frac{{∥ X - Y ∥}^{2}}{2 σ^{2}})

In the experiments, the optimum kernel parameter

(σ)

is selected from the set

{0.1, 0.5, 1, 10, 15} .

5.1.2. Hyperparameters

The method has three hyperparameters. These parameters help to avoid over or underfitting, which improves performance. The first parameter is a number of hidden nodes

(L)

. The optimum number of hidden nodes is selected from the set

{1, \dots, 8}

. Due to the boundedness of tanh function

(- 1 < tanh (x) < 1)

, the error signal must be scaled to this range. Thus,

λ^{'}

should be selected from the set

{- 0.9, \dots, - 0.1, 0.1, 0.2, \dots, 0.9}

. Figure 5 shows that accuracy is symmetric with respect to

λ^{'}

. Thus,

λ^{'}

should be selected from the set

{0.1, \dots, 0.9}

. The possible range for

C^{'}

is investigated in the next part.

Figure 5. Effect of

λ^{'}

on accuracy. Parameter

C^{'}

is set to

0.45

. The experiment is performed on the diabetes dataset with the network with only one hidden node.

5.1.3. Data Normalization

In this paper, the input vector of data samples is normalized into the range

[- 1, 1]

. For regression datasets, their targets are normalized into the range

[0, 1]

.

In this paper, most of the datasets are taken from the UCI Machine Learning Repository [36] and Statlib [37]. These datasets are specified in Table 1 and Table 2.

Table 1. Specification of the regression problem.

Table 2. Specification of the classification problem.

5.2. Convergence

This part investigates the convergence of the proposed method (Theorem 6), followed by an investigation of the hyperparameters.

5.2.1. Investigation of Theorem 6

The main goal of this paper is to maximize the correntropy of the error function. Regarding the kernel definition and due to maximization of correntropy, the approximator (output of the neural network,

f_{L}

) has the most similarity to the target function

(f)

. Theorem 6 proves that the proposed method obtains the optimal solution, i.e., the correntropy of the error function is maximized. This part investigates the convergence of the proposed method. In this experiment, the kernel parameter is set to 10; thus, the optimum value for correntropy is

v (e = 0) = V^{max} = 0.0399

. Figure 6 shows the convergence of C2N2 to the optimum value in the approximation of the sinc function.

Figure 6. Convergence of the proposed method in the approximation of Sinc function when

σ = 10

. It converges to

V^{max} = 0.0399

.

5.2.2. Hyperparameter Evaluation

To evaluate parameters

C^{'}

and

λ^{'}

, C2N2 with only one hidden node was experimented on using the diabetes dataset. Figure 7 shows that the best amount for parameter

C^{'}

is in the range

[0, 1]

. Thus, in the experiments, parameter

C^{'}

is selected from the set

\{0.05, 0.15, 0.25, \dots, 0.95\}

.

Figure 7. Effect of hyperparameters

(C^{'}, λ^{'})

on accuracy.

5.3. Comparison

This part compares the proposed method with the networks

N_{1}, \dots, N_{6}

, CCN, OHLCN, OLSCN, and RBF in the presence and absence of non-Gaussian noise. One of the worst types of non-Gaussian noise is impulsive noise.

For

λ^{'}

= 0, the accuracy is set to zero. This type of noise adversely affects the performance of MSE-based methods such as the networks

N_{1} - N_{6}

, CCN, OHLCN and OLSCN. In this part and part D, we perform experiments similar to [4]. We calculated the RMSE (classification accuracy) on the testing dataset after each hidden unit was added and reported the lowest (highest) RMSE (accuracy) along with the corresponding network size. Similar to [4], experiments were carried out in 20 trials and the results (RMSE (accuracy) and number of nodes) averaged over 20 trials are listed in Table 3, Table 4, Table 5 and Table 6.

Table 3. Performance comparison of C2N2 and the networks

N_{4}, N_{5}, N_{6}

, CCN and RBF: benchmark regression dataset.

Table 4. Performance comparison of RBF,

N_{4}, N_{5}, N_{6}

, CCN and C2N2: classification datasets.

Table 5. Performance comparison of C2N2 and the state-of-the-art constructive networks OLSCN and OHLCN: benchmark regression dataset.

Table 6. Performance comparison of C2N2 AND the state-of-the-art constructive networks OLSCN and OHLCN: benchmark classification dataset.

In all result tables, the best results are shown in bold and underlined. The results that are close to the best ones are in bold.

5.3.1. Synthetic Dataset (Sinc Function)

Figure 8 compares C2N2 with RBF and the network N1 in the approximation of the Sinc function. In Figure 9, the experiment is performed in the presence of impulsive noise.

Figure 8. Comparison of C2N2 with RBF and the network

N_{1}

. The experiment is performed on the approximation of the Sinc function.

Figure 9. Comparison of C2N2 with RBF and the network

N_{1}

. The experiment is performed on the approximation of the Sinc function and in the presence of impulsive noise.

5.3.2. Other Synthetic Dataset

The following regression problems are used to evaluate the performance of the proposed method in comparison to the networks

N_{1}, N_{2}

and

N_{3}

.

\begin{matrix} f^{(1)} (x_{1}, x_{2}) = 10.391 ((x_{1} - 0.4) (x_{2} - 0.6) + 0.36) \\ f^{(2)} (x_{1}, x_{2}) = 24.234 r^{2} (0.75 - r^{2}) \end{matrix}

where

\begin{matrix} r = {(x_{1} - 0.5)}^{2} + {(x_{2} - 0.5)}^{2} \\ f^{(3)} (x_{1}, x_{2}) = & 42.659 (0.1 + r_{1} (0.05 + r_{1}^{4} - 10 r_{1}^{2} r_{2}^{2} + 5 r_{2}^{4})) \end{matrix}

r_{1} = x_{1} - 0.5 and r_{2} = x_{2} - 0

f^{(4)} (x_{1}, x_{2}) =

\begin{matrix} 1.3365 (1.5 (1 - x_{1}) + e^{2 x_{1} - 1} sin (3 π {(x_{1} - 0.6)}^{2}) + e^{3} (x_{2} - 0.5) + sin (4 π {(x_{2} - 0.9)}^{2})) \end{matrix}

For each of the above functions, 225 pairs

(x_{1}, x_{2})

are generated randomly in the interval

[0, 1]

. For each piece of training data, its target is assigned as

\begin{matrix} y_{j i} = f_{l}^{(j)} (x_{i 1}, x_{i 2}) = f^{(j)} (x_{i 1}, x_{i 2}) + η_{i l} \\ i = 1, \dots, 225 and j = 1, \dots, 4 and l = 0, 1 \end{matrix}

where

η_{i}

is noise that is added to the target of the data samples. In this section, the index of the noise

(l)

is:

\{\begin{matrix} 0 . without noise \\ 1 . η_{i 1} = impulse noise \end{matrix}

For impulse noise, the outputs of five data samples are changed by extra high values using the uniform distribution.

5.4. Discussion

5.4.1. Discussion on Table 7

Table 7 compares C2N2 with the networks

N_{1}, N_{2}

and

N_{3}

. From the table, we can see that in the absence of noise, the proposed method outperforms the other methods on the datasets

f_{0}^{(2)}, f_{0}^{(3)}

and

f_{0}^{(4)}

. For the dataset

f_{0}^{(1)}

, the best result is for the network

N_{2}

. We added impulsive noise to the dataset and again performed experiments. From Table 7, we can see that the proposed method is more stable when data are contaminated with impulsive noise. For example, for the dataset

f_{0}^{(2)}

, the RMSEs of C2N2 and

N_{1}

to

N_{3}

are close. However, in the presence of impulsive noise for the dataset

f_{1}^{(2)}

, the RMSEs for C2N2 and

N_{1}

to

N_{3}

are 0.2487, 0.3676, 0.2790, and 0.3226 respectively. This means that noisy data samples have less effect on the proposed method in comparison to the other mentioned methods and the proposed method is more stable. The goal of any learning method is to increase performance. Thus, in this paper, we focus on RMSE. However, from the architecture viewpoint, the proposed method tends to have a smaller number of nodes in

50 %

of the datasets.

Table 7. Performance comparison of C2N2 and the networks

N_{1}, N_{2}

and

N_{3}

: synthetic regression dataset.

Table 7. Performance comparison of C2N2 and the networks

N_{1}, N_{2}

and

N_{3}

: synthetic regression dataset.

Datasets	C2N2			$N_{1}$			$N_{2}$			$N_{3}$
Datasets	Testing RMSE	#N	Time (s)	Testing RMSE	#N	Time (s)	Testing RMSE	#N	Time (s)	Testing RMSE	#N	Time (s)
$f_{0}^{(1)}$	0.1358	$\underset{̲}{3.70}$	0.69	0.1555	4.85	1.48	$\underset{̲}{0.1284}$	6.30	1.78	0.1536	7.05	0.81
$f_{1}^{(1)}$	$\underset{̲}{0.1587}$	6.25	0.91	0.3244	$3.70$	0.44	0.2460	$\underset{̲}{3.20}$	0.72	0.2750	4.40	2.40
$f_{0}^{(2)}$	$0.1963$	$\underset{̲}{2.70}$	1.09	$0.1953$	5.90	1.91	$0.1953$	4.55	1.57	$0.1954$	5.45	2.18
$f_{1}^{(2)}$	$\underset{̲}{0.2487}$	$3.55$	3.98	0.3676	5.25	1.08	0.2790	$\underset{̲}{3.25}$	0.89	0.3226	5.65	1.16
$f_{0}^{(3)}$	$\underset{̲}{0.0892}$	$\underset{̲}{3.15}$	1.05	0.0940	6.80	0.12	0.0943	$3.30$	1.34	0.0935	6.40	0.34
$f_{1}^{(3)}$	$\underset{̲}{0.1416}$	4.15	1.12	0.2949	4.15	0.64	0.2309	$\underset{̲}{2.35}$	0.51	0.2555	4.35	0.57
$f_{0}^{(4)}$	$0.1136$	$\underset{̲}{2.10}$	1.81	$0.1141$	4.55	1.34	$0.1139$	8.10	2.02	0.1963	4.70	0.87
$f_{1}^{(4)}$	$\underset{̲}{0.1360}$	7.70	3.01	0.2928	$\underset{̲}{1.65}$	1.54	0.2569	4.20	0.98	0.2961	4.85	0.45

5.4.2. Why C2N2 Denies Impulse Noises?

Regarding optimization problems, for noisy (impulsive noise) data, auxiliary variables

α_{i}^{t}, γ_{i}^{t}

are low. Thus, such data have a small role in optimization problems; parameters of the new node are obtained based on noise-free data (Remark 1). Table 8 shows the auxiliary variables for several noisy and noise-free data samples.

Table 8. Amount of auxiliary variables for noisy and noise-free data. The experiment is performed on the

f^{1}

dataset.

5.4.3. Benchmark Dataset

In this part, several regression and classification datasets are contaminated with impulse noise. At this time, as in [38], we produce impulsive noise by generating random real numbers from the following distribution function, and then we add them to data samples:

η = 0.95 N (0, 10^{- 4}) + 0.05 N (0, 10)

where

N (μ, σ)

is a Gaussian distribution function with the mean

μ

and variance

σ

. For the regression dataset, we add noise to its target. For the classification dataset, we add noise to its input feature vector. Experiments on these datasets confirm the robustness of the proposed method in comparison with

N_{4}, \dots, N_{6}

, CCN and RBF, OLSCN and OHLCN.

5.4.4. Discussion on Table 3

Table 3 compares C2N2 with the networks

N_{4}, N_{5}, N_{6}

, CCN and RBF on the Autoprice, Baloon and Pyrim datasets in the presence and absence of impulsive noise. It shows that C2N2 is more stable than the networks N4 to N6, CCN and RBF in the presence of non-Gaussian noise. For example, for the Autoprice dataset, RMSEs for C2N2, N4-N6, CCN and RBF are

0.2689, 0.2996, 0.2689, 0.3681, 0.2775

and

0.2725

, respectively. After adding noise, RMSEs are

0.2770, 0.5521, 0.3610, 0.4295, 0.4768

and

0.9082

respectively. These results confirm that the proposed method is robust to impulsive noise in comparison to the mentioned methods.

5.4.5. Discussion on Table 4

This table compares C2N2 with the networks

N_{4}, N_{5}, N_{6}

, CCN and RBF on the Ionosphere, Colon, Leukemia and Dimdata datasets in the presence of impulsive noise. It shows that C2N2 outperforms the other mentioned methods on the Ionosphere, Colon and Leukemia datasets. For the dataset Dimdata, RBF outperforms the proposed method. However, the result is obtained with 1000 nodes for

RBF

and with an average of

3.5

nodes for C2N2.

5.4.6. Discussion on Table 5 and Table 6

These tables compare C2N2 with state-of-the-art constructive networks on the regression and classification datasets. They show that the best results are for C2N2. For the Housing dataset, RMSEs for C2N2, OLSCN and OHLCN are

0.0966, 0.0988

and

0.0993

, respectively. In the presence of impulsive noise, RMSEs are

0.0978, 0.2411

and

0.1824

. Thus, the proposed method has the best stability among the other mentioned state-of-the-art constructive networks when data samples are contaminated with impulsive noise. From the architecture viewpoint, OLSCN has the most compact architecture. However, it has the worst training time. Table 6 shows that C2N2 outperforms OLSCN and OHLCN in the presence of impulsive noise in the classification dataset.

5.4.7. Computational Complexity

Let L be the maximum number of hidden units to be added to the network. For each newly added node, its input parameter is adjusted as specified in Section 4.4.1. The order of adjusting the inverse of the

d \times d

matrix is

d^{3}

. Thus,

h =

O (L k (min (d^{3}, N^{3}) + N^{2}))

, where k is the constant term (number of iterations). Thus, we have

h = O (L (min (d^{3}, N^{3}) + N^{2})) .

5.5. Comparison

This part compares C2N2 with state-of-the-art robust learning methods on several benchmark datasets. These methods are Robust Least Square SVM (RLS-SVM), MLPMEE and MLPMCC. As mentioned in Section 5.4, and similar to [4], in this part, experiments are performed in 20 trials and average results of RMSE (accuracy for classification) and the number of hidden nodes(#N) are reported in Table 9.

Table 9. Performance comparison of C2N2 and the state-of-the-art robust methods MLPMEE, MLPMCC and RLS-LSVM: benchmark regression dataset.

5.5.1. Discussion on Table 9

From the table, we can see that for Pyrim, Prim (noise) and baskball (noise), C2N2 absolutely outperforms the other robust methods in terms of RMSE. For Bodyfat and Bodyfat (noise), C2N2 slightly outperforms other methods in terms of RMSE. Thus, to compare them, we need to check the number of nodes and training times. For both datasets, C2N2 has a fewer number of nodes. In the presence of noise, C2N2 has a better training time in comparison to RLS-LSVM.

Thus, the proposed method outperforms the other methods for these two datasets. Among these six datasets, RLS-SVM only outperforms C2N2 in one dataset, i.e., Baskball; however, it has a worse training time and more nodes. It can be seen that among the robust methods, the proposed method has the most compact architecture.

5.5.2. Discussion on Table 10

This table compares the recent work, CCOEN, with the proposed method on three datasets in the presence and absence of noise. According to the table, the proposed method outperforms CCOEN in all cases except the Cloud dataset in the presence of noise where CCOEN has a slightly better performance with more hidden nodes. From the architecture viewpoint, the proposed method has a fewer number of nodes in comparison to CCOEN in most cases. Therefore, correntropy with the Gaussian kernel provides better results in comparison to the sigmoid kernel.

Table 10. Performance comparison of C2N2 and the recent work. CCOEN: benchmark regression dataset.

Dataset	C2N2		CCOEN
Dataset	Testing RMSE	#Nodes	Testing RMSE	#Nodes
Abalone	$\underset{̲}{0.075}$	$\underset{̲}{6}$	0.090	8.8
Abalone (Noise)	$\underset{̲}{0.079}$	$\underset{̲}{2.6}$	$0.091$	7.8
Cleveland	$\underset{̲}{0.061}$	$\underset{̲}{4.2}$	0.791	6.1
Cleveland (Noise)	$\underset{̲}{0.066}$	$\underset{̲}{2}$	0.821	8.5
Cloud	$\underset{̲}{0.277}$	$\underset{̲}{4.2}$	0.293	4.7
Cloud (noise)	0.302	5.6	$\underset{̲}{0.290}$	$\underset{̲}{4.8}$

6. Conclusions

In this paper, a new constructive feedforward network is presented that is robust to non-Gaussian noises. Most of the other existing constructive networks are trained based on the mean square error (MSE) objective function and consequently act weak in the presence of non-Gaussian noises, especially impulsive noise. Correntropy is a local similarity measure of two random variables and is successfully used as the objective function for the training of adaptive systems such as adaptive filters. In this paper, this objective function with a Gaussian kernel is utilized to adjust the input and output parameters of the newly added node in a constructive network. It is proved that the new node obtained from the input side optimization is not orthogonal to the residual error of the network. Regarding this fact, correntropy of the residual error converges to its optimum when the error and the activation function are continuous random variables in

L^{2} (Ω, F, P)

space where the triple

(Ω, F, P)

is considered as a probability space. During the training on datasets, the function form of error is not available; thus, we provide a method to adjust the input and output parameters of the new node from training data samples. The auxiliary variables that appear in input and output side optimization problems decrease the effect of the non-Gaussian noises. For example, for impulsive noise, these variables are close to zero; thus, these data samples have little role in optimizing the parameters of the network. For the MSE-based constructive networks, the data samples that are contaminated by impulsive noises have a great role in optimizing the parameters of the network, and consequently, the network is not robust. The experiments are performed on some synthetic and benchmark datasets. For the synthetic datasets, the experiments are performed in the presence and absence of impulsive noises. We saw that for the datasets that are contaminated by impulsive noises, the proposed method has significantly better performance than the state-of-the-art MSE-based constructive network. For the other synthetic and benchmark datasets, in most cases, the proposed method has satisfactory performance in comparison to the MSE-based constructive network and radial basis function (RBF). Furthermore, C2N2 was compared with state-of-the-art robust learning methods such as MLPMEE, MLPMCC and the robust version of the Least Square Support Vector Machine and CCOEN. The performances are obtained with compact architectures due to the input parameters being optimized. We also see that correntropy with Gaussian kernel provides better results in comparison to the correntropy with sigmoid kernel.

The use of the correntropy-based function introduced in this research may also benefit networks with other architectures toward enhancing the generalization performance and robustness level. In the context of further research, the validity of similar results can be verified for various classes of neural networks. In addition, since impulsive noise is one of the worst cases of non-Gaussian noise, it can be expected that a different non-Gaussian noise will yield a result between clean data and data with impulsive noise. This should be verified in further experiments.

It is also necessary to point out here other novel modern avenues and similar research directions. For example, ref. [39] delves into modal regression, presenting a statistical learning perspective that could enrich the discussion on learning algorithms and their efficiency in different noise conditions. In particular, it points out that correntropy-based regression can be interpreted from a modal regression viewpoint when the tuning parameter goes to zero. At the same time, [40] depicts a big picture of correntropy-based regression by showing that with different choices of the tuning parameter, correntropy-based regression learns a location function.

Correntropy not only has inferential properties that could be used for neural network analysis, but another approach could be, for example, cross-sample entropy-based techniques. One such direction was shown to be effective in [41] with reported results of simulation on exchange market datasets.

Finally, it is also worth mentioning that the choice of the algorithm applied for optimizing the objective functions can influence the results. The usage of non-smooth methodology focusing on bundle-based algorithms [42] as a possible efficient tool in machine learning and neural network analysis can also be tested.

Author Contributions

All authors contributed to the paper. The experimental part was mainly done by the first author, M.N. Conceptualization was done by M.R. and H.S.Y. Verification and final editing of the manuscript was done by Y.N., A.M. and M.M.M. who played a role of a research director. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data and codes can be requested from the first author.

Acknowledgments

The authors would like to express their thanks to the GA. Hodtani and Ehsan Shams Davodly for their constructive remarks and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

C2N2	Correntropy-based constructive neural network
MCC	Maximum correntropy criterion
MSE	Mean square error
MEE	Minimum Error Entropy
EMSE	Excess mean square error
OLS	Orthogonal least square
CCOEN	Cascade correntropy network
MLPMEE	Multi-Layer Perceptron based on Minimum Error Entropy
MLPMCC	Multi-Layer Perceptron based on correntropy
RLS-SVM	Robust Least Square Support Vector Machine
FFN	Feedforward network
RBF	Radial basis function
ITL	Information theoretic learning
CGN	Cascade correlation network
OHLCN	One hidden layer constructive adaptive neural network

Appendix A

Appendix A.1. Proof of Lemma 1

Proof.

Similar to [28], let

k_{L} = \frac{∥e_{L - 1}∥}{∥g_{L}^{∥}∥} γ, γ \in R - {0}

where

\forall g_{L} \in G |cos (θ_{e_{L - 1}}, g_{L}^{*})| \geq |cos (θ_{e_{L - 1}}, g_{L})| .

Thus,

\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - k_{L} g_{L}∥}^{2}}{2 σ^{2}}) =

\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2} + {∥e_{L - 1}∥}^{2} γ^{2} \frac{∥ ∥_{L} ∥^{2}}{{∥g_{I_{L}}∥}^{2}} - 2 γ ∥e_{L - 1}∥ \frac{∥g_{g^{2}}∥}{∥ ∥_{L}^{L} ∥} cos (θ_{e_{L - 1}}, g_{L})}{2 σ^{2}}) .

Let

A = \frac{∥g_{L}∥}{∥g_{L}^{L}∥}

, we have,

\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - k_{L} g_{L}∥}^{2}}{2 σ^{2}}) =

\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2} (1 + A^{2} γ^{2} - 2 γ A cos (θ_{e_{L - 1}}, g_{L}))}{2 σ^{2}}) .

We need to prove that

k_{L} \in R - {0}

exists such that

\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2} (1 + γ^{2} - 2 γ cos (θ_{e_{L - 1}}, g_{L}^{*}))}{2 σ^{2}}) \geq

\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2} (1 + A^{2} γ^{2} - 2 γ A cos (θ_{e_{L - 1}}, g_{L}))}{2 σ^{2}}), \forall g_{L} \in G .

Suppose that the inequality above does not hold.

Two possible conditions may happen:

$cos (θ_{e_{L - 1}}, g_{L}^{*}) \geq 0$

1.1. If

γ > 0

, we have

\forall γ > 0 2 γ (cos (θ_{e_{L - 1}}, g_{L}^{*}) - A cos (θ_{e_{L - 1}}, g_{L}^{*}))

\leq 2 γ (cos (θ_{e_{L - 1}}, g_{L}^{*}) - A cos (θ_{e_{L - 1}}, g_{L})) \leq γ^{2} (1 - A^{2}), \forall g_{L} \in G

\to \forall γ > 0 2 γ cos (θ_{e_{L - 1}}, g_{L}^{*}) (1 - A) \leq γ^{2} (1 - A^{2}), \forall g_{L} \in G .

Again, there are two possible conditions: First,

(1 - A) \geq 0

. Then

\to \forall γ > 0 2 cos (θ_{e_{L - 1}}, g_{L}^{*}) \leq γ (1 + A) \forall g_{L} \in G

.

\to \forall γ > 0 \frac{2 cos (θ_{e_{L - 1}}, g_{L}^{*})}{(1 + A)} \leq γ, \forall g_{L} \in G

.

Let

γ = \frac{cos (θ_{e_{L - 1}}, g_{L}^{*})}{(1 + A)} \to cos (θ_{e_{L - 1}}, g_{L}^{*}) \leq 0, \forall g_{L} \in G

. From the assumption and the above inequality, we have

cos (θ_{e_{L - 1}}, g_{L}^{*}) = 0

.

This is contradicted by the fact that

span (G)

is dense in

L^{2} (Ω, F, P)

. Second:

(1 - A) \leq 0

.

\to \forall γ > 0 2 cos (θ_{e_{L - 1}}, g_{L}^{*}) \geq γ (1 + A) \forall g_{L} \in G

.

\to \forall γ > 0 \frac{2 cos (θ_{e_{L - 1}}, g_{L}^{*})}{(1 + A)} \geq γ \forall g_{L} \in G

.

Let

γ = \frac{3 cos (θ_{e_{L - 1}}, g_{L}^{*})}{(1 + A)} \to cos (θ_{e_{L - 1}}, g_{L}^{*}) \leq 0 \forall g_{L} \in G

.

From the assumption above, we have

cos (θ_{e_{L - 1}}, g_{L}^{*}) = 0

.

This is contradicted by the fact that

span (G)

is dense in

L^{2} (Ω, F, P)

.

2.: $cos (θ_{e_{L - 1}}, g_{L}^{*}) \leq 0 .$

2.1. If

γ < 0

, we have

\forall γ < 0 2 γ (cos (θ_{e_{L - 1}}, g_{L}^{*}) - A cos (θ_{e_{L - 1}}, g_{L}^{*})) \leq 2 γ (cos (θ_{e_{L - 1}}, g_{L}^{*}) - A cos (θ_{e_{L - 1}}, g_{L}))

\leq γ^{2} (1 - A^{2}), \forall g_{L} \in G .

\to \forall γ < 0 2 γ cos (θ_{e_{L - 1}}, g_{L}^{*}) (1 - A) \leq γ^{2} (1 - A^{2}) \forall g_{L} \in G

\to \forall γ < 0 2 cos (θ_{e_{L - 1}}, g_{L}^{*}) (1 - A) \leq γ (1 - A^{2}) \forall g_{L} \in G .

Again, there are two possible conditions: First:

(1 - A) \geq 0

. Then

\forall γ < 0 2 cos (θ_{e_{L - 1}}, g_{L}^{*}) \leq γ (1 + A) \forall g_{L} \in G

,

\forall γ < 0 \frac{2 cos (θ_{e_{L - 1}}, g_{L}^{*})}{(1 + A)} \leq γ \forall g_{L} \in G

.

Let

γ = \frac{3 cos (θ_{e_{L - 1}}, g_{L}^{*})}{(1 + A)} \to cos (θ_{e_{L - 1}}, g_{L}^{*}) \geq 0, \forall g_{L} \in G

.

\in G

cos (θ_{e_{L - 1}}, g_{L}^{*}) = 0

.

From the assumption, we have

cos (θ_{e_{L - 1}}, g_{L}^{*}) = 0

.

This is contradicted by the fact that

span (G)

is dense in

L^{2} (Ω, F, P)

. Second:

(1 - A) \leq 0

\forall γ < 0 2 cos (θ_{e_{L - 1}}, g_{L}^{*}) \geq γ (1 + A) \forall g_{L} \in G

.

\forall γ < 0 \frac{2 cos (θ_{e_{L - 1}}, g_{L}^{*})}{(1 + A)} \geq γ \forall g_{L} \in G

.

Let

γ = \frac{cos (θ_{e_{L - 1}}, g_{L}^{*})}{(1 + A)} \to cos (θ_{e_{L - 1}}, g_{L}^{*}) \geq 0, \forall g_{L} \in G

.

From the assumption above, we have,

cos (θ_{e_{L - 1}}, g_{L}^{*}) = 0

, and this is contradicted by the fact that

span (G)

is dense in

L^{2} (Ω, F, P)

. Based on the above arguments, a real number

γ \neq 0

exists such that the following inequality holds:

\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2} (1 + γ^{2} - 2 γ cos (θ_{e_{L - 1}}, g_{L}^{*}))}{2 σ^{2}})

\geq \frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2} (1 + A^{2} γ^{2} - 2 γ A cos (θ_{e_{L - 1}}, g_{L}))}{2 σ^{2}}), \forall g_{L} \in G .

Thus, from Theorem 4, we have

E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2} (1 + γ^{2} - 2 γ cos (θ_{e_{L - 1}}, g_{L}^{*}))}{2 σ^{2}})) \geq

E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2} (1 + A^{2} γ^{2} - 2 γ A cos (θ_{e_{L - 1}}, g_{L}))}{2 σ^{2}})), \forall g_{L} \in G .

Thus, there exists a real number

k_{L} \in R - {0}

such that

E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - k_{L} g_{L}^{*}∥}^{2}}{2 σ^{2}})) \geq E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - k_{L} g_{L}∥}^{2}}{2 σ^{2}})), \forall g_{L} \in G,

\to V (g_{L}^{*}) \geq V (g_{L}), \forall g_{L} \in G .

This means that

g_{L}^{sim (e)} = g_{L}^{*}

and this completes the proof. □

Appendix A.2. Proof of Theorem 6

Proof.

Inspired by [3], the proof of this theorem is divided into two parts: First, we prove that the correntropy of the network strictly increases after adding each hidden node, and then we prove that the supremum of the correntropy of the network is

V_{max} = \frac{1}{\sqrt{2 π} σ}

.

Step 1: The correntropies of an SLFN with

L - 1

and L hidden nodes are:

V (e_{L - 1}) = E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2}}{2 σ^{2}})),

V (e_{L}) = E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L}∥}^{2}}{2 σ^{2}}))

= E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - β_{L}^{s i m (e)} g_{L}^{sim (e)}∥}^{2}}{2 σ^{2}})),

respectively.

In the following, it is proved that

\{k_{L}, g_{L}^{e}\}

exists such that

V (e_{L}) > V (e_{L - 1}) .

Let

k_{L}^{'} = η ∥e_{L - 1}∥, η \in R - {0},

then we have

V (g_{L}) = E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{∥e_{L - 1} - η∥ e_{L - 1} {∥g_{L}∥}^{2}}{2 σ^{2}})) = E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{U}{2 σ^{2}})),

where

U = {∥e_{L - 1}∥}^{2} + η^{2} {∥e_{L - 1}∥}^{2} {∥g_{L}∥}^{2} - 2 η {∥e_{L - 1}∥}^{2} ∥g_{L}∥ cos (θ_{e_{L - 1}}, g_{L})

Thus,

V (g_{L}) = E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2} ((1 + η^{2} {∥g_{L}∥}^{2}) - 2 η ∥g_{L}∥ cos (θ_{e_{L - 1}}, g_{L}))}{2 σ^{2}})) .

We need to prove that

g_{L} \in G

and

η \in R - {0}

exist such that

V (g_{L}) > V (e_{L - 1}) .

Suppose that there are no

η \in R - {0}

and

g_{L} \in G

such that

exp (- \frac{{∥e_{L - 1}∥}^{2} ((1 + η^{2} {∥g_{L}∥}^{2}) - 2 η ∥g_{L}∥ cos (θ_{e_{L - 1}}, g_{L}))}{2 σ^{2}}) \leq exp (- \frac{{∥e_{L - 1}∥}^{2}}{2 σ^{2}}),

\to exp (- \frac{{∥e_{L - 1}∥}^{2}}{2 σ^{2}}) exp (\frac{{∥e_{L - 1}∥}^{2} ((1 + η^{2} {∥g_{L}∥}^{2}) - 2 η ∥g_{L}∥ cos (θ_{e_{L - 1}}, g_{L}))}{2 σ^{2}}) \geq 1,

\forall g_{L} \in G

and

\forall η \in R - {0} .

Then the following inequality holds,

exp (\frac{{∥e_{L - 1}∥}^{2} ((η^{2} {∥g_{L}∥}^{2}) - 2 η ∥g_{L}∥ cos (θ_{e_{L - 1}}, g_{L}))}{2 σ^{2}}) \geq exp (0),

\forall g_{L} \in G \forall η \in R - {0}

.

\to {∥e_{L - 1}∥}^{2} ((η^{2} {∥g_{L}∥}^{2}) - 2 η ∥g_{L}∥ cos (θ_{e_{L - 1}}, g_{L})) \geq 0,

\forall g_{L} \in G

and

\forall η \in R - {0}

\to η^{2} {∥g_{L}∥}^{2} - 2 η ∥g_{L}∥ cos (θ_{e_{L - 1}}, g_{L}) \geq 0,

\forall g_{L} \in G

and

\forall η \in R - {0}

.

Let

η = \frac{cos (θ_{e_{L - 1}}, g_{L})}{∥g_{L}∥}, {cos}^{2} (θ_{e_{L - 1}}, g_{L}) - 2 {cos}^{2} (θ_{e_{L - 1}}, g_{L}) \geq 0 \forall g_{L} \in G,

\to {cos}^{2} (θ_{e_{L - 1}, g_{L}}) \leq 0 \forall g_{L} \in G .

Thus,

{cos}^{2} (θ_{e_{L - 1}}, g_{L}) = 0 \forall g_{L} \in G

. This is contradictory to

span (G)

being dense in

L^{2} (Ω, F, P)

; thus, we have

\exists g_{L} \in G, \exists k_{L}^{'} \in R - {0} \frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - k_{L}^{'} g_{L}∥}^{2}}{2 σ^{2}}) > \frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2}}{2 σ^{2}})

.

Based on Theorem 4, the following inequality holds with probability one:

\exists g_{L} \in G, \exists k_{L}^{'} \in R - {0}

E (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1} - k_{L}^{'} g_{L}∥}^{2}}{2 σ^{2}})) > (\frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥e_{L - 1}∥}^{2}}{2 σ^{2}}))

, i.e.,

\exists g_{L} \in G, \exists k_{L}^{'} \in R - {0}, E (k (e_{L - 1} - k_{L}^{'} g_{L})) > E (k (e_{L - 1}))

almost surely.

Based on the above argument, with probability one, we have:

V (e_{L}) > V (e_{L - 1})

Based on Theorem 5, since the correntropy is strictly increasing, with probability one it converges to its supremum.

Step 2: We know that

V_{L}^{β} = E (exp (- \frac{{∥e_{L - 1} - β_{L} g_{L}^{sim (e)}∥}^{2}}{2 σ^{2}})) =

E (sup_{α \in R^{-}} (α \frac{{∥e_{L - 1} - β_{L} g_{L}^{sim (e)}∥}^{2}}{2 σ^{2}} - ϕ (α))),

and according to Proposition 1, we have

α = - G (\frac{{∥e_{L - 1} - β_{L} g_{L}^{sim (e)}∥}^{2}}{2 σ^{2}}),

There is

V_{L m a x}^{β} = E ((α \frac{{∥e_{L - 1} - β_{L} g_{L}^{sim (e)}∥}^{2}}{2 σ^{2}} - ϕ (α))) .

Therefore, the optimum

β_{L}

is

β_{L} = \frac{E (γ e_{L - 1} g_{L})}{E (γ g_{L}^{2})} .

In the previous step, we showed that correntropy converges; the norm of error converges and

γ

converges to a constant term. In the case of constant

γ

, similar to [3], the error sequence constitutes a Cauchy sequence and because the mentioned probability space is complete,

\exists e^{*} \in L^{p} (Ω, F, P)

. Therefore,

\forall L > N, \forall g \in G

e_{L} \to e^{*},

lim_{L \to \infty} \frac{E {(e_{L - 1} g_{L})}^{2}}{E (g_{L}^{2})} = 0 .

Thus, similar to [3], we have

As we know,

E (e^{*} g) = 0, \forall g \in G

, and we have

lim_{L \to \infty} E (e_{L - 1} g) = 0, \forall g \in G

and

∥e^{*}∥ = 0 .

Based on Theorem 3 and

{lim}_{L \to \infty} ∥e_{L}∥ = 0

, we have

{lim}_{L \to \infty} E (k (e_{L})) = E ({lim}_{L \to \infty} (k (e_{L}))) = E (\frac{1}{\sqrt{2 π} σ} e^{- {lim}_{L \to \infty} {∥e_{L}∥}^{2}}) = E (k (0)) = \frac{1}{\sqrt{2 π} σ} = V_{m a x}

. Based on step 1 and step 2, we have

{lim}_{L \to \infty} V (e_{L}) = V_{\max}

almost surely.

This completes the proof. □

References

Erdogmus, D.; Principe, J.C. An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. Signal Process. IEEE Trans. 2002, 50, 1780–1786. [Google Scholar] [CrossRef]
Fahlman, S.E.; Lebiere, C. The cascade-correlation learning architecture. In Proceedings of the Advances in Neural Information Processing Systems 2, NIPS Conference, Denver, CO, USA, 27–30 November 1989; pp. 524–532. [Google Scholar]
Kwok, T.-Y.; Yeung, D.-Y. Objective functions for training new hidden units in constructive neural networks. Neural Netw. IEEE Trans. 1997, 8, 1131–1148. [Google Scholar] [CrossRef] [PubMed]
Huang, G.; Song, S.; Wu, C. Orthogonal least squares algorithm for training cascade neural networks. Circuits Syst. Regul. Pap. IEEE Trans. 2012, 59, 2629–2637. [Google Scholar] [CrossRef]
Ma, L.; Khorasani, K. New training strategies for constructive neural networks with application to regression problems. Neural Netw. 2004, 17, 589–609. [Google Scholar] [CrossRef] [PubMed]
Ma, L.; Khorasani, K. Constructive feedforward neural networks using Hermite polynomial activation functions. Neural Netw. IEEE Trans. 2005, 16, 821–833. [Google Scholar] [CrossRef] [PubMed]
Reed, R. Pruning algorithms-a survey. Neural Netw. IEEE Trans. 1993, 4, 740–747. [Google Scholar] [CrossRef] [PubMed]
Castellano, G.; Fanelli, A.M.; Pelillo, M. An iterative pruning algorithm for feedforward neural networks. Neural Netw. IEEE Trans. 1997, 8, 519–531. [Google Scholar] [CrossRef] [PubMed]
Engelbrecht, A.P. A new pruning heuristic based on variance analysis of sensitivity information. Neural Netw. IEEE Trans. 2001, 12, 1386–1399. [Google Scholar] [CrossRef]
Zeng, X.; Yeung, D.S. Hidden neuron pruning of multilayer perceptrons using a quantified sensitivity measure. Neurocomputing 2006, 69, 825–837. [Google Scholar] [CrossRef]
Sakar, A.; Mammone, R.J. Growing and pruning neural tree networks. Comput. IEEE Trans. 1993, 42, 291–299. [Google Scholar] [CrossRef]
Huang, G.-B.; Saratchandran, P.; Sundararajan, N. A generalized growing and pruning RBF (GGAPRBF) neural network for function approximation. Neural Netw. IEEE Trans. 2005, 16, 57–67. [Google Scholar] [CrossRef]
Huang, G.-B.; Saratchandran, P.; Sundararajan, N. An efficient sequential learning algorithm for growing and pruning RBF (GAP-RBF) networks. Syst. Man. Cybern. Part Cybern. IEEE Trans. 2004, 34, 2284–2292. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Rozycki, P.; Wilamowski, B.M. A Hybrid Constructive Algorithm for Single-Layer Feedforward Networks Learning. IEEE Trans. Neural Netw. Learn. Syst. 2014, 26, 1659–1668. [Google Scholar] [CrossRef] [PubMed]
Santamaría, I.; Pokharel, P.P.; Principe, J.C. Generalized correlation function: Definition, properties, and application to blind equalization. Signal Process. IEEE Trans. 2006, 54, 2187–2197. [Google Scholar] [CrossRef]
Liu, W.; Pokharel, P.P.; Príncipe, J.C. Correntropy: Properties and applications in non-Gaussian signal processing. Signal Process. IEEE Trans. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
Bessa, R.J.; Miranda, V.; Gama, J. Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. Power Syst. IEEE Trans. 2009, 24, 1657–1666. [Google Scholar] [CrossRef]
Singh, A.; Principe, J.C. Using correntropy as a cost function in linear adaptive filters. In Proceedings of the 2009 International Joint Conference on Neural Networks, Atlanta, GA, USA, 14–19 June 2009; pp. 2950–2955. [Google Scholar]
Shi, L.; Lin, Y. Convex Combination of Adaptive Filters under the Maximum Correntropy Criterion in Impulsive Interference. Signal Process. Lett. IEEE 2014, 21, 1385–1388. [Google Scholar] [CrossRef]
Zhao, S.; Chen, B.; Principe, J.C. Kernel adaptive filtering with maximum correntropy criterion. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; pp. 2012–2017. [Google Scholar]
Wu, Z.; Peng, S.; Chen, B.; Zhao, H. Robust Hammerstein Adaptive Filtering under Maximum Correntropy Criterion. Entropy 2015, 17, 7149–7166. [Google Scholar] [CrossRef]
Chen, B.; Wang, J.; Zhao, H.; Zheng, N.; Principe, J.C. Convergence of a fixed-point algorithm under Maximum Correntropy Criterion. Signal Process. Lett. IEEE 2015, 22, 1723–1727. [Google Scholar] [CrossRef]
Chen, B.; Xing, L.; Liang, J.; Zheng, N.; Principe, J.C. Steady-state mean-square error analysis for adaptive filtering under the maximum correntropy criterion. Signal Process. Lett. IEEE 2014, 21, 880–884. [Google Scholar]
Chen, L.; Qu, H.; Zhao, J.; Chen, B.; Principe, J.C. Efficient and robust deep learning with Correntropyinduced loss function. Neural Comput. Appl. 2015, 27, 1019–1031. [Google Scholar] [CrossRef]
Singh, A.; Principe, J.C. A loss function for classification based on a robust similarity metric. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010; pp. 1–6. [Google Scholar]
Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; Suykens, J.A. Learning with the maximum correntropy criterion induced losses for regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
Chen, B.; Príncipe, J.C. Maximum correntropy estimation is a smoothed MAP estimation. Signal Process. Lett. IEEE 2012, 19, 491–494. [Google Scholar] [CrossRef]
Nayyeri, M.; Yazdi, H.S.; Maskooki, A.; Rouhani, M. Universal Approximation by Using the Correntropy Objective Function. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4515–4521. [Google Scholar] [CrossRef]
Athreya, K.B.; Lahiri, S.N. Measure Theory and Probability Theory; Springer Science & Business Media: New York, NY, USA, 2006. [Google Scholar]
Fournier, N.; Guillin, A. On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Relat. Fields 2015, 162, 707–738. [Google Scholar] [CrossRef]
Leshno, M.; Lin, V.Y.; Pinkus, A.; Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 1993, 6, 861–867. [Google Scholar] [CrossRef]
Yuan, X.-T.; Hu, B.-G. Robust feature extraction via information theoretic learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1193–1200. [Google Scholar]
Klenke, A. Probability Theory: A Comprehensive Course; Springer Science & Business Media: New York, NY, USA, 2013. [Google Scholar]
Rudin, W. Principles of Mathematical Analysis; McGraw-Hill: New York, NY, USA, 1964; Volume 3. [Google Scholar]
Yang, X.; Tan, L.; He, L. A robust least squares support vector machine for regression and classification with noise. Neurocomputing 2014, 140, 41–52. [Google Scholar] [CrossRef]
Newman, D.; Hettich, S.; Blake, C.; Merz, C.; Aha, D. UCI Repository of Machine Learning Databases; Department of Information and Computer Science, University of California: Irvine, CA, USA, 1998; Available online: https://archive.ics.uci.edu/ (accessed on 29 November 2023).
Meyer, M.; Vlachos, P. Statlib. 1989. Available online: https://lib.stat.cmu.edu/datasets/ (accessed on 29 November 2023).
Pokharel, P.P.; Liu, W.; Principe, J.C. A low complexity robust detector in impulsive noise. Signal Process. 2009, 89, 1902–1909. [Google Scholar] [CrossRef]
Feng, Y.; Fan, J.; Suykens, J.A. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020, 21, 1–35. [Google Scholar]
Feng, Y. New Insights into Learning with Correntropy-Based Regression. Neural Comput. 2021, 33, 157–173. [Google Scholar] [CrossRef]
Ramirez-Parietti, I.; Contreras-Reyes, J.E.; Idrovo-Aguirre, B.J. Cross-sample entropy estimation for time series analysis: A nonparametric approach. Nonlinear Dyn. 2021, 105, 2485–2508. [Google Scholar] [CrossRef]
Bagirov, A.; Karmitsa, N.; Mäkelä, M.M. Introduction to Nonsmooth Optimization: Theory, Practice and Software; Springer International Publishing: Cham, Switzerland; Heidelberg, Germany, 2014; Volume 12. [Google Scholar]

Figure 1. SLFN with additive nodes.

Figure 2. Constructive network in which the last added node is referred to by L.

Figure 3. Tangent hyperbolic function.

Figure 4. Sigmoid function.

Figure 5. Effect of

λ^{'}

on accuracy. Parameter

C^{'}

is set to

0.45

. The experiment is performed on the diabetes dataset with the network with only one hidden node.

Figure 6. Convergence of the proposed method in the approximation of Sinc function when

σ = 10

. It converges to

V^{max} = 0.0399

.

Figure 7. Effect of hyperparameters

(C^{'}, λ^{'})

on accuracy.

Figure 8. Comparison of C2N2 with RBF and the network

N_{1}

. The experiment is performed on the approximation of the Sinc function.

Figure 9. Comparison of C2N2 with RBF and the network

N_{1}

. The experiment is performed on the approximation of the Sinc function and in the presence of impulsive noise.

Table 1. Specification of the regression problem.

Datasets	#Train	#Test	#Features
Baskball	64	32	4
Strike	416	209	6
Bodyfat	168	84	14
Quake	1452	726	3
Autoprice	106	53	9
Baloon	1334	667	2
Pyrim	49	25	27
Housing	337	169	13
Abalone	836	3341	8
Cleveland	149	148	13
Cloud	54	54	7

Table 2. Specification of the classification problem.

Dataset	#Train	#Test	#Features
Ionosphere	175	176	34
Australian Credit	460	230	6
Diabetes	512	256	8
Colon	32	30	2000
Liver	230	115	6
Leukemia	36	36	7129
Dimdata	1000	3192	14

Table 3. Performance comparison of C2N2 and the networks

N_{4}, N_{5}, N_{6}

, CCN and RBF: benchmark regression dataset.

Table 3. Performance comparison of C2N2 and the networks

N_{4}, N_{5}, N_{6}

, CCN and RBF: benchmark regression dataset.

Datasets	C2N2			$N_{4}$			$N_{5}$			$N_{6}$			CCN			RBF
Datasets	Testing RMSE	#N	Time (s)	Testing RMSE	$# N$	Time (s)	Testing RMSE	$# N$	Time (s)	Testing RMSE	$# N$	Time (s)	Testing RMSE	$# N$	Time (s)	Testing RMSE	$# N$	Time (s)
Autoprice	$\underset{̲}{0.2689}$	4.8	0.54	0.2996	2	1.49	$\underset{̲}{0.2689}$	9.2	3.99	0.3681	5	6.99	0.2758	$\underset{̲}{1.5}$	1.67	0.2725	79	0.008
Autoprice (Noise)	$\underset{̲}{0.2770}$	6.7	0.70	0.5521	$1.33$	1.07	0.3610	2.60	0.43	0.4295	3.77	1.41	0.4768	$\underset{̲}{1.2}$	1.89	0.9082	79	0.009
Baloon	0.1065	5.9	10.73	0.1163	5.75	5.21	0.1066	10	4.96	0.1257	8.3	9.97	0.1317	$\underset{̲}{3.58}$	2.09	$\underset{̲}{0.0563}$	150	0.086
Baloon (Noise)	$0.1056$	5.1	3.25	0.1166	$\underset{̲}{2}$	1.32	0.1252	8.9	4.66	0.1281	5	3.31	0.1358	2.9	2.12	$\underset{̲}{0.1051}$	1501	0.1351
Pyrim	$\underset{̲}{0.0482}$	6.7	0.23	0.0843	$\underset{̲}{1}$	0.27	0.2062	$\underset{̲}{1.2}$	0.07	0.1696	$\underset{̲}{1}$	0.17	0.1694	$\underset{̲}{1}$	0.17	0.0842	37	0.0090
Pyrim (noise)	$\underset{̲}{0.0521}$	4.9	0.16	0.6666	$\underset{̲}{1.5}$	0.32	0.4150	$\underset{̲}{1}$	0.036	0.6203	$\underset{̲}{1}$	0.62	0.5712	$\underset{̲}{1}$	0.62	1.5034	37	0.0043

Table 4. Performance comparison of RBF,

N_{4}, N_{5}, N_{6}

, CCN and C2N2: classification datasets.

Table 4. Performance comparison of RBF,

N_{4}, N_{5}, N_{6}

, CCN and C2N2: classification datasets.

Datasets	$N_{4}$			$N_{5}$			$N_{6}$			RBF			CCN			C2N2
Datasets	Testing Rate (%)	$# N$	Time (s)	Testing Rate (%)	$# N$	Time (s)	Testing Rate (%)	$# N$	Time (s)	Testing Rate (%)	$# N$	Time (s)	Testing Rate (%)	$# N$	Time (s)	Testing Rate (%)	$# N$	Time (s)
Ionospher	70.97	$1.10$	0.12	65.45	$\underset{̲}{1}$	0.11	78.98	2.40	0.25	82.61	175	0.14	78.04	$\underset{̲}{1.60}$	0.21	$\underset{̲}{86.70}$	1.30	0.22
Colon	62.00	$\underset{̲}{1}$	0.02	63.00	$1.15$	0.03	62.33	$1.35$	0.38	90.50	32	0.12	64.04	$\underset{̲}{1.05}$	0.29	$\underset{̲}{91.50}$	$\underset{̲}{1}$	0.15
Leukemia	64.44	$\underset{̲}{1}$	0.03	64.44	$\underset{̲}{1}$	0.05	72.44	$\underset{̲}{1}$	0.04	88.61	36	0.22	83.71	$\underset{̲}{1.3}$	0.31	94.72	1	0.20
Dimdata	89.74	7.05	15.31	88.44	6.60	9.39	88.77	4.30	7.34	$\underset{̲}{95.05}$	1000	4.36	88.37	$\underset{̲}{3.60}$	8.98	93.73	3.50	8.29

Table 5. Performance comparison of C2N2 and the state-of-the-art constructive networks OLSCN and OHLCN: benchmark regression dataset.

Dataset	C2N2			OLSCN			OHLCN
Dataset	Testing RMSE	#Nodes	Time (s)	Testing RMSE	#Nodes	Time (s)	Testing RMSE	#Nodes	Time (s)
Housing	$\underset{̲}{0.0966}$	5.6	0.79	0.0988	$\underset{̲}{1.9}$	1.44	0.0993	2.8	0.38
Housing (Noise)	$\underset{̲}{0.0978}$	5.87	1.07	0.2411	$\underset{̲}{1.1}$	1.53	0.1824	4.3	0.09
Strike	$\underset{̲}{0.2807}$	2.8	1.11	$0.2818$	$\underset{̲}{2}$	2.77	0.2888	3.3	0.62
Strike (Noise)	$\underset{̲}{0.2817}$	4.4	0.87	0.3017	$\underset{̲}{2}$	2.21	0.3912	6.0	0.07
Quake	$\underset{̲}{0.1784}$	6.6	2.05	0.1821	$\underset{̲}{2}$	3.88	0.1815	6	0.023
Quake (noise)	$\underset{̲}{0.1744}$	$2$	0.42	0.1870	$\underset{̲}{1.75}$	2.21	0.1849	4	0.021

Table 6. Performance comparison of C2N2 AND the state-of-the-art constructive networks OLSCN and OHLCN: benchmark classification dataset.

Dataset	C2N2			OLSCN			OHLCN
Dataset	Testing RMSE	#Nodes	Time (s)	Testing RMSE	#Nodes	Time (s)	Testing RMSE	#Nodes	Time (s)
Australian (Noise)	$\underset{̲}{86.77}$	3	0.48	$86.67$	2	10.88	$86.09$	$\underset{̲}{1}$	1.01
Liver (Noise)	$\underset{̲}{65.70}$	4	0.25	$65.12$	$\underset{̲}{2}$	1.01	53.49	9	0.02
Diabete (noise)	$\underset{̲}{79.95}$	$\underset{̲}{1}$	1.11	78.65	2	3.36	40.56	$\underset{̲}{1}$	6.23

Table 8. Amount of auxiliary variables for noisy and noise-free data. The experiment is performed on the

f^{1}

dataset.

Table 8. Amount of auxiliary variables for noisy and noise-free data. The experiment is performed on the

f^{1}

dataset.

Noisy Data	$# 3$	$# 10$	$# 36$	$# 55$	$# 100$
$(α_{i}, γ_{i})$	$(- 0.2351, - 3.29 * 10^{- 22})$	$- 0.2551, - 4.3 * 10^{- 22})$	$(- 0.2310, - 1.04 * 10^{- 24})$	$(- 0.2607, - 8.06 * 10^{- 22})$	$(- 0.2482, - 6.12 * 10^{- 23})$
Noise-free data	$# 4$	$# 11$	$# 37$	$# 56$	$# 101$
$α_{i}, γ_{i}$	$(0.3837, - 0.3989)$	$(- 0.3983, - 0.3989)$	$(- 0.3989, - 0.3955)$	$(- 0.3989, - 0.3982)$	$(- 0.3988, - 0.3988)$

Table 9. Performance comparison of C2N2 and the state-of-the-art robust methods MLPMEE, MLPMCC and RLS-LSVM: benchmark regression dataset.

Dataset	C2N2			MLPMEE			MLPMCC			RLS-LSVM
Dataset	Testing RMSE	#Nodes	Time (s)	Testing RMSE	#Nodes	Time (s)	Testing RMSE	#Nodes	Time (s)	Testing RMSE	#Nodes	Time (s)
Bodyfat	$\underset{̲}{0.00281}$	$\underset{̲}{5}$	0.4623	0.0045	10	-	$\underset{̲}{0.00291}$	40	-	$\underset{̲}{0.00295}$	101	$\underset{̲}{0.0789}$
Bodyfat (Noise)	$\underset{̲}{0.00241}$	$\underset{̲}{4}$	$\underset{̲}{0.2751}$	$0.00251$	10	-	$\underset{̲}{0.00257}$	10	-	0.00451	101	9.651
Pyrim	$\underset{̲}{0.0488}$	$\underset{̲}{7}$	0.2712	0.0798	20	-	0.0882	40	-	0.0817	37.3	$\underset{̲}{0.0352}$
Pyrim (Noise)	$\underset{̲}{0.05213}$	$\underset{̲}{4.4}$	$\underset{̲}{0.1521}$	$0.0586$	10	-	0.12034	30	-	0.12345	37	4.495
Baskball	0.12293	$\underset{̲}{5.5}$	$\underset{̲}{0.2454}$	0.1352	30	-	0.13114	20	-	$\underset{̲}{0.1143}$	48	0.3687
Baskball (noise)	$\underset{̲}{0.11981}$	$\underset{̲}{1.33}$	$\underset{̲}{0.0238}$	0.14352	20	-	0.1328	20	-	0.12839	48	22.569

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Correntropy-Based Constructive One Hidden Layer Neural Network

Abstract

1. Introduction

2. Mathematical Notations, Definitions and Preliminaries

2.1. Measure Space, Probability Space and Function Space

2.2. Network Structure

3. Previous Work

3.1. The Networks Introduced in [3]

3.2. Cascade Correntropy Network (CCOEN) [28]

4. Proposed Method

4.1. Preliminaries for Presenting the Proposed Method

4.2. C2N2: Objective Function for Training the New Node

4.3. Convergence Analysis

4.4. Learning from Data Samples

4.4.1. Input Side Optimization

4.4.2. Output Side Optimization

5. Experimental Results

5.1. Framework for Experiments

5.1.1. Activation Function and Kernel

5.1.2. Hyperparameters

5.1.3. Data Normalization

5.2. Convergence

5.2.1. Investigation of Theorem 6

5.2.2. Hyperparameter Evaluation

5.3. Comparison

5.3.1. Synthetic Dataset (Sinc Function)

5.3.2. Other Synthetic Dataset

5.4. Discussion

5.4.1. Discussion on Table 7

5.4.2. Why C2N2 Denies Impulse Noises?

5.4.3. Benchmark Dataset

5.4.4. Discussion on Table 3

5.4.5. Discussion on Table 4

5.4.6. Discussion on Table 5 and Table 6

5.4.7. Computational Complexity

5.5. Comparison

5.5.1. Discussion on Table 9

5.5.2. Discussion on Table 10

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Proof of Lemma 1

Appendix A.2. Proof of Theorem 6

References

Article Metrics

Citations

Article Access Statistics