Robust Twin Extreme Learning Machine Based on Soft Truncated Capped L1-Norm Loss Function

Xu, Zhendong; Wei, Bo; Yu, Guolin; Ma, Jun

doi:10.3390/electronics13224533

Open AccessArticle

Robust Twin Extreme Learning Machine Based on Soft Truncated Capped L₁-Norm Loss Function

School of Mathematics and Information Sciences, North Minzu University, Yinchuan 750021, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(22), 4533; https://doi.org/10.3390/electronics13224533

Submission received: 13 October 2024 / Revised: 11 November 2024 / Accepted: 13 November 2024 / Published: 19 November 2024

(This article belongs to the Special Issue Advanced Control Strategies and Applications of Multi-Agent Systems)

Download

Browse Figures

Versions Notes

Abstract

Currently, most researchers propose robust algorithms from different perspectives for overcoming the impact of outliers on a model, such as introducing loss functions. However, some loss functions often fail to achieve satisfactory results when the outliers are large. Therefore, the capped loss has become a better choice for researchers. The majority of researchers directly set an upper bound on the loss function, which reduces the impact of large outliers, but also introduces non-differentiable regions. To avoid this shortcoming, we propose a robust twin extreme learning machine based on a soft-capped

L_{1}

-normal loss function (SCTELM). It uses a soft capped

L_{1}

-norm loss function. This not only overcomes the shortcomings of the hard capped loss function, but also improves the robustness of the model. Simultaneously, to improve the learning efficiency of the model, the stochastic variance-reduced gradient (SVRG) optimization algorithm is used. Experimental results on several datasets show that the proposed algorithm can compete with state-of-the-art algorithms in terms of robustness.

Keywords:

robustness; twin extreme learning machine; capped L₁-norm; classification

1. Introduction

Extreme learning machine (ELM) [1,2] was proposed by Huang et al. on a single hidden layer feedforward network. Compared with traditional training methods, ELM has advantages in learning speed and generalization performance. This is because ELM differs from traditional methods, that is, the input consists of randomly generated weights and bias values of the hidden layer nodes, allowing for the quick and effective determination of output weights. Additionally, The ultimate aim of ELM is to reduce the training error and the norm of the weights, which enhances ELM’s generalization performance in feedforward neural networks. Nowadays, ELM is widely used in ship detection [3], online visual tracking [4], image quality assessment [5], and other fields.

In recent years, many excellent scholars have proposed different algorithms based on ELM and achieved good performance. However, most of the ELM-based algorithms are more sensitive to outliers and affect the performance of the model. In recent years, the robust learning algorithm has become a new direction of machine learning. In order to overcome the influence of outliers on ELM, researchers have proposed some robust algorithms based on ELM from different perspectives [6,7,8,9,10,11,12]. From the perspective of loss function and based on expected penalty and correntropy, Wu et al. [7] proposed an asymmetric non-convex bounded loss function (AC-loss) and obtained a robust ELM learning framework of

L_{1}

-norm robust regularized extreme learning machine with asymmetric AC-loss (

L_{1}

-ACELM). Ren and Yang [11] proposed a new hybrid loss function using pinball loss and least squares loss. Then, based on the hybrid loss function, a robust ELM framework with a hybrid loss function was proposed. Luo et al. [12] proposed that the correntropy loss function induced by the p-order Laplace kernel can be used to replace the

L_{2}

loss function to reduce the influence of outliers during learning.

In recent years, the

L_{1}

-norm, called the sparse rule operator, has been given attention and applied in more and more fields, and it is considered to be robust to outliers [13]. Zhang and Luo et al. [14] took advantage of the robustness of

L_{1}

-norm to introduce the

L_{1}

loss function into ELM to solve the outlier problem. Although

L_{1}

-norm performs well in improving the robustness of the algorithm, most existing algorithms based on

L_{1}

-norm may not obtain a satisfactory result when the outliers are large. Therefore, more and more scholars have tried to use the capped

L_{1}

-norm to achieve some good results [15,16,17,18,19]. Following is a brief review of some representative works, Wang et al. [16] proposed a new capped

L_{1}

-norm twin SVM. Wang and Chen et al. [17] proposed a capped

L_{1}

-norm sparse representation method (CSR) to remove outliers and ensure the quality of the graph. Li et al. [18] proposed a new robust capped

L_{1}

-norm twin support vector machine (R-CTSVM+), which adopted the capped

L_{1}

regularization distance to ensure the robustness of the model.

In recent years, inspired by SVM and GEPSVM, Jayadeva et al. [20] proposed twin SVM (TSVM). Compared with SVM, TWSVM shows great improvement in computational complexity and computational speed because it improves the speed of work by solving a large quadratic programming problem into a pair of small quadratic programming problems, which is theoretically four times faster than the traditional SVM [20,21]. Due to the excellent performance of TSVM, scholars have proposed many variants of TSVM. For example, Kumar et al. and Shao et al. proposed least squares TSVM (LSTSVM) [22] and twin bounded SVM (TBSVM) [23] based on TSVM. LSTSVM simplifies the solution of a pair of complex QPPs in TSVM to only solving two linear equations, which greatly improves the operation speed but loses the sparsity, and TBSVM considers the minimization of the structural risk of TSVM by introducing a regulaization term. These improvements make TSVM closer to perfection, while the recognition ability and adaptability are also strengthened. Wan et al. [24] proposed twin extreme learning machine inspired by TWSVM. He combined twin support vector machine (TWSVM) with extreme learning machine (ELM), so TELM has the advantages of both TWSVM and ELM. At the same time, compared with ELM, TELM has faster learning speed.

Most researchers use the hard capped strategy to enhance the robustness of models by reducing the impact of noisy data, but it has inherent shortcomings, such as inadvertently ignoring key data and introducing non-differentiable regions. To address this, this paper proposes a soft capped loss function based on the

L_{1}

-norm (SC-loss) and introduces SC-loss into TELM to construct the SCTELM model. SC-loss is a soft capped loss function that reaches the upper limit more smoothly than hard capped and can be applied to various machine learning problems, effectively reducing the impact of large outliers on the model. We also employ stochastic variance-reduced gradient (SVRG) to further improve the learning rate. A large number of experimental results show that SCTELM is more competitive in terms of accuracy and learning efficiency.

The main contributions of this paper are as follows:

This paper proposes an efficient and reliable learning framework based on TELM, namely robust twin extreme learning machine with the $L_{1}$ -norm loss function based on the soft capped strategy (SCTELM).
This paper proposes an $L_{1}$ -norm loss with soft capped, which can reach the upper bound more smoothly to reduce the impact of large outliers on the model.
Experimental results on various datasets show that our proposed SCTELM is competitive with other algorithms in terms of accuracy and learning efficiency.

The rest of this paper is organized as follows, and Section 2 briefly reviews some properties of ELM, TWSVM and TELM. In Section 3, some details of our proposed SC-loss and SCTELM are described and the related theoretical analysis is given. Experimental results on multiple datasets are presented in Section 4. Section 5 concludes the paper.

2. Related Work

In this section, we will give a brief overview of extreme learning machine (ELM), twin support vector machine (TWSVM) and twin extreme learning machine (TELM), along with an introduction to the symbols covered in this paper.

2.1. Notations

In this paper, uppercase letters are used to represent matrices while lowercase letters represent vectors. All vectors are column vectors. The symbol

{(.)}^{T}

denotes transpose, e represents an all-one column vector of any dimension and I stands for the identity matrix.

The model being discussed in this article is a binary classification model implemented in

R^{n}

. The sample data comprise

l_{1}

positive samples labeled with +1, and

l_{2}

negative samples labeled with −1, and

l_{1} + l_{2} = l

. Let the matrix

A \in R^{l_{1} \times n}, B \in R^{l_{2} \times n}

represent the positive and negative classes, respectively.

2.2. ELM

The extreme learning machine (ELM) is a generalized single-layer feedforward network (SLFN) proposed by Huang et al. It employs the random initialization of the connection weights and biases between the hidden layer and the output layer. The output function of ELM is

\begin{matrix} f (x) = h (x) β \end{matrix}

(1)

The ELM algorithm is shown in Figure 1, where

β

is the weight between the hidden layer and the output layer, and

h (x) = (h_{1} (x), h_{2} (x), h_{3} (x), \dots, h_{L} (x)) \in R^{1 \times L}

is the random mapping of the hidden layer for the input pattern

x = {(x_{1}, x_{2}, x_{3}, \dots, x_{n})}^{T}

. L is the number of hidden nodes of ELM and

h_{j} (x) = g (\sum_{i = 1}^{L} x_{i} ω_{i j} + b_{j})

is the activation function for example

\frac{e^{x}}{e^{x} + 1}

.

To obtain the optimal

β

, it is necessary to ensure that the training error is minimized. The optimal solution can be achieved by minimizing the norm of the difference between

H β

and the labeled sample T, which involves minimizing both the norm

β

of the training error and output weights.

\begin{matrix} min ∥ H β - T ∥ a n d min ∥ β ∥ \end{matrix}

(2)

where H is the output matrix of the hidden layer and T is the sample label matrix,

H = {(h {(x_{1})}^{T}, h {(x_{2})}^{T}, h {(x_{3})}^{T}, \dots, h {(x_{n})}^{T})}^{T}, T = {(t_{1}, t_{2}, t_{3}, \dots, t_{n})}^{T}

.

Thus, the optimal solution of Equation (2) can be deduced as

\begin{matrix} β^{*} = H^{†} T \end{matrix}

(3)

where

H^{†}

is the Moore–Penrose generalized inverse matrix of the matrix H.

2.3. TWSVM

The TWSVM algorithm is used to find two non-parallel hyperplanes, one for each class, such that the sample points of one class are as close to this hyperplane as possible, while the sample points of the other class are as far away as possible. Finally, whether a point belongs to this class is judged by judging whether it is close to this type of hyperplane. The two hyperplanes are written, respectively, as follows:

\begin{matrix} f_{1} (x) = ω_{1}^{T} x + b_{1} = 0 a n d f_{2} (x) = ω_{2}^{T} x + b_{2} = 0 \end{matrix}

(4)

TWSVM is composed of two quadratic programming problems instead of a QPP problem, so TSVM produces two QPP problems of smaller size, and it is faster than the traditional SVM in computing time. The TWSVM classifier is obtained by solving the following two quadratic programming problems:

\begin{matrix} min_{ω_{1}, b_{1}} \frac{1}{2} {(A ω_{1} + e_{1} b_{1})}^{T} (A ω_{1} + e_{1} b_{1}) + C_{1} e_{2}^{T} ξ, \\ s . t - (B ω_{1} + e_{2} b_{1}) + ξ \geq e_{2}, ξ \geq 0 . \end{matrix}

(5)

and

\begin{matrix} min_{ω_{2}, b_{2}} \frac{1}{2} {(B ω_{2} + e_{2} b_{2})}^{T} (B ω_{2} + e_{2} b_{2}) + C_{2} e_{1}^{T} ξ, \\ s . t (A ω_{2} + e_{1} b_{2}) + ξ \geq e_{1}, ξ \geq 0 . \end{matrix}

(6)

where

C_{1} \geq 0

and

C_{2} \geq 0

are regularization parameters; and

ξ

is the slack vectors. After using the Lagrange multipliers

α \geq 0

,

β \geq 0

, and the Karush–Kuhn–Tucker (K.K.T.) conditions, the duals of the QPPs in (5) and (6) are defined as follows:

\begin{matrix} min_{α} \frac{1}{2} α^{T} G {(H^{T} H)}^{- 1} G^{T} α - e_{2}^{T} α, \\ s . t 0 \leq α \leq C_{1} e_{2} . \end{matrix}

(7)

\begin{matrix} min_{β} \frac{1}{2} β^{T} H {(G^{T} G)}^{- 1} H^{T} β - e_{2}^{T} β, \\ s . t 0 \leq β \leq C_{2} e_{1} . \end{matrix}

(8)

where

H = [A, e_{1}]

and

G = [B, e_{2}]

. After solving (7) and (8), the proximal hyperplanes are given as follows:

{[w_{1}, b_{1}]}^{T} = - {(H^{T} H + ϵ I)}^{- 1} G^{T} α, {[w_{2}, b_{2}]}^{T} = {(G^{T} G + ϵ I)}^{- 1} H^{T} β .

where

ϵ I

is a regularization term and

ϵ > 0

. After we obtain

(w_{1}, b_{1}), (w_{2}, b_{2})

, we can classify the new sample point x according to the following decision function:

\begin{matrix} y = a r g m i n {\frac{| f_{1} (x) |}{∥ w_{1} ∥}, \frac{| f_{2} (x) |}{∥ w_{2} ∥}} \end{matrix}

(9)

2.4. TELM

The concept of TSVM is followed by TELM, which also determines two non-parallel hyperplanes in ELM:

\begin{matrix} f_{1} (x) = h (x) β_{1} = 0 a n d f_{2} (x) = h (x) β_{2} = 0 \end{matrix}

(10)

Similar to TWSVM, the primal problem of TELM can be expressed as

\begin{matrix} min_{β_{1}, ξ} \frac{1}{2} {∥ U β_{1} ∥}_{2}^{2} + C_{1} e_{2}^{T} ξ \\ s . t - V β_{1} + ξ \geq e_{2}, ξ \geq 0 . \end{matrix}

(11)

\begin{matrix} min_{β_{2}, η} \frac{1}{2} {∥ V β_{2} ∥}_{2}^{2} + C_{2} e_{1}^{T} η \\ s . t U β_{2} + η \geq e_{1}, η \geq 0 . \end{matrix}

(12)

where

C 1, C 2 > 0

are trade-off parameters;

ξ

and

η

are non-negative slack vectors; and

U = {(h {(x_{1})}^{T}, h {(x_{2})}^{T}, \dots, h {(x_{l_{1}})}^{T})}^{T} \in R^{l_{1} \times L}

and

V = {(h {(x_{1})}^{T}, h {(x_{2})}^{T}, \dots, h {(x_{l_{2}})}^{T})}^{T} \in R^{l_{2} \times L}

correspond to the output matrices of the hidden layer for the positive and negative classes, respectively. Based on the the Karush–Kuhn–Tucker (K.K.T.) conditions, training such a TELM is equivalent to solving the following dual optimization problems:

\begin{matrix} max_{α} e_{2}^{T} α - \frac{1}{2} α^{T} V {(U^{T} U)}^{- 1} V^{T} α, \\ s . t 0 \leq α \leq C_{1} e_{2}, \end{matrix}

(13)

\begin{matrix} max_{γ} e_{2}^{T} γ - \frac{1}{2} γ^{T} U {(V^{T} V)}^{- 1} U^{T} γ, \\ s . t 0 \leq γ \leq C_{2} e_{1} . \end{matrix}

(14)

where

α

and

γ

are the vectors of the Lagrange multiplier. Simultaneously,

β_{1} = - {(U^{T} U + ϵ I)}^{- 1} V^{T} α

and

β_{2} = {(V^{T} V + ϵ I)}^{- 1} U^{T} γ

can be obtained. When a new sample vector x is obtained, the positive and negative classes of x can be divided according to the following decision function:

\begin{matrix} f (x) = a r g m i n {| h (x) β_{1} |, | h (x) β_{2} |} \end{matrix}

(15)

3. Main Contributions

In this section, we introduce a capped

L_{1}

-norm loss into TELM, thereby establishing the proposed SCTELM in this paper. We optimize it using the SVRG algorithm and further conduct a convergence analysis of the model to investigate its stability.

3.1. Capped SC-Loss Function

To alleviate the non-differentiable region that may be introduced by hard capped, we propose the soft-truncated SC-loss function

L_{θ}

as follows:

Definition 1.

Given a vector u, the

L_{θ}

function is defined as

\begin{matrix} L_{θ} (u) = 1 - \frac{1}{1 + θ | u |} \end{matrix}

(16)

where

θ \geq 0

is the adaptive parameter of the loss function. For example, as shown in Figure 2, the capped

L_{1}

-norm (

ϵ = 1

) loss is non-differentiable at

u = 1

, and our proposed

L_{θ}

-loss can avoid this problem, thus making it more convenient for us when dealing with it.

The following are some properties of the

L_{θ} - l o s s

:

Theorem 1.

L_{θ}

is a symmetric function.

Proof of Theorem 1.

\begin{matrix} L_{θ} (- u) & = 1 - \frac{1}{1 + θ | - u |} \\ = 1 - \frac{1}{1 + θ | u |} = L_{θ} (u) \end{matrix}

(17)

□

Theorem 2.

L_{θ}

is a non-negative function.

Proof of Theorem 2.

\begin{matrix} L_{θ} (u) & = 1 - \frac{1}{1 + θ | u |} \\ = \frac{θ | u |}{1 + θ | u |} \geq 0 \end{matrix}

(18)

□

Theorem 3.

L_{θ}

is a bounded function.

Proof of Theorem 3.

\begin{matrix} lim_{u \to \infty} L_{θ} (u) & = 1 - \frac{1}{1 + θ | u |} \\ = 1 \end{matrix}

(19)

□

Remark 1.

Here, 1 is the maximum value of

L_{θ} (u)

, which determines the last term of the loss function and is therefore robust.

3.2. Linear SCTELM

Based on (11), (12) and our proposed

L_{θ}

-loss function, we can obtain the formula for SCTELM, expressed as follows:

SCTELM1:: $min_{β_{1}} \frac{1}{2} {∥ U β_{1} ∥}^{2} + C_{1} \sum_{i = 1}^{l_{2}} (1 - \frac{1}{1 + θ | 1 + h (x_{i}) β_{1} |})$

(20)
SCTELM2:: $min_{β_{2}} \frac{1}{2} {∥ V β_{2} ∥}^{2} + C_{2} \sum_{j = 1}^{l_{2}} (1 - \frac{1}{1 - θ | 1 + h (x_{j}) β_{2} |})$

(21)

for SCTELM1, substituting

β_{1} = - {(U^{T} U + ϵ_{1} I)}^{- 1} V^{T} α

into

∥ U β_{1} ∥^{2}

gives

\begin{matrix} ∥ U β_{1} ∥^{2} & = {(U β_{1})}^{T} (U β_{1}) \\ = {[- U {(U^{T} U + ϵ_{1} I)}^{- 1} V^{T} α]}^{T} [- U {(U^{T} U + ϵ_{1} I)}^{- 1} V^{T} α] \\ = α^{T} V {(U^{T} U + ϵ_{1} I)}^{- 1} U^{T} U {(U^{T} U + ϵ_{1} I)}^{- 1} V^{T} α \end{matrix}

(22)

Let

D_{1} = {(U^{T} U + ϵ I)}^{- 1} U^{T} U {(U^{T} U + ϵ I)}^{- 1}

; there is

\begin{matrix} ∥ U β_{1} ∥^{2} & = {(U β_{1})}^{T} (U β_{1}) \\ = α^{T} V D_{1} V^{T} α \\ = \sum_{i = 1}^{l_{2}} α_{i} h (x_{i}) D_{1} \sum_{k = 1}^{l_{2}} h {(x_{k})}^{T} α_{k} \\ = \sum_{i = 1}^{l_{2}} \sum_{k = 1}^{l_{2}} α_{i} α_{k} h (x_{i}) D_{1} h {(x_{k})}^{T} \end{matrix}

(23)

Substituting

β_{1} = - {(U^{T} U + ϵ I)}^{- 1} V^{T} α

into

h (x_{i}) β_{1}

gives

\begin{matrix} h (x_{i}) β_{1} & = - h (x_{i}) {(U^{T} U + ϵ I)}^{- 1} V^{T} α \end{matrix}

(24)

Let

E_{1} = {(U^{T} U + ϵ I)}^{- 1}

; there is

\begin{matrix} h (x_{i}) β_{1} & = - h (x_{i}) E_{1} V^{T} α \\ = - h (x_{i}) E_{1} \sum_{k = 1}^{l_{2}} h {(x_{k})}^{T} α_{k} \\ = - \sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T} \end{matrix}

(25)

Thus, (20) can be expressed as

\begin{matrix} min_{α_{i}} \frac{1}{2} \sum_{i = 1}^{l_{2}} \sum_{k = 1}^{l_{2}} α_{i} α_{k} h (x_{i}) D_{1} h {(x_{k})}^{T} + C_{1} \sum_{i = 1}^{l_{2}} (1 - \frac{1}{1 + θ | 1 - \sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T} |}) \end{matrix}

(26)

For a given sample

x_{i}

, its objective function is denoted by

\begin{matrix} f_{i} (α) = \frac{1}{2} \sum_{k = 1}^{l_{2}} α_{i} α_{k} h (x_{i}) D_{1} h {(x_{k})}^{T} + C_{1} (1 - \frac{1}{1 + θ | 1 - \sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T} |}) \end{matrix}

(27)

If

1 - \sum_{k = 1}^{l_{2}} α_{j} h (x_{i}) E_{1} h {(x_{k})}^{T} \leq 0

, there is

\begin{matrix} f_{i} (α) & = \frac{1}{2} \sum_{k = 1}^{l_{2}} α_{i} α_{k} h (x_{i}) D_{1} h {(x_{k})}^{T} + C_{1} (1 - \frac{1}{1 + θ (\sum_{j = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T} - 1)}) \\ = \frac{1}{2} \sum_{k = 1}^{l_{2}} α_{i} α_{k} h (x_{i}) D_{1} h {(x_{k})}^{T} + \frac{C_{1} θ (\sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T} - 1)}{1 + θ (\sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T} - 1)} \end{matrix}

(28)

Let

ξ_{i} (α) = 1 - \sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T}

, there is

\begin{matrix} f_{i} (α) = \frac{1}{2} \sum_{k = 1}^{l_{2}} α_{i} α_{k} h (x_{i}) D_{1} h {(x_{k})}^{T} - \frac{C_{1} θ ξ_{i} (α)}{1 - θ ξ_{i} (α)} \end{matrix}

(29)

and the gradient of

f_{i} (α)

with respect to

α

can be represented by

\begin{matrix} \nabla f_{i} (α) = \frac{\partial f_{i} (α)}{\partial α} = \frac{V D_{1} V^{T} α}{l_{2}} + \frac{C_{1} θ V E_{1} h {(x_{i})}^{T}}{{(1 - θ ξ_{i} (α))}^{2}} \end{matrix}

(30)

If

1 - \sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T} > 0

, we then have

\begin{matrix} f_{i} (α) & = \frac{1}{2} \sum_{k = 1}^{l_{2}} α_{i} α_{k} h (x_{i}) D_{1} h {(x_{k})}^{T} + C_{1} (1 - \frac{1}{1 + θ (1 - \sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T})}) \\ = \frac{1}{2} \sum_{k = 1}^{l_{2}} α_{i} α_{k} h (x_{i}) D_{1} h {(x_{k})}^{T} + \frac{C_{1} θ (1 - \sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T})}{1 + θ (1 - \sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{1} h {(x_{k})}^{T})} \\ = \frac{1}{2} \sum_{k = 1}^{l_{2}} α_{i} α_{k} h (x_{i}) D_{1} h {(x_{k})}^{T} + \frac{C_{1} θ ξ_{i} (α)}{1 + θ ξ_{i} (α)} \end{matrix}

(31)

and the gradient of

f_{i} (α)

with respect to

α

can be represented by

\begin{matrix} \nabla f_{i} (α) = \frac{\partial f_{i} (α)}{\partial α} = \frac{V D_{1} V^{T} α}{l_{2}} - \frac{C_{1} θ V E_{1} h {(x_{i})}^{T}}{{(1 + θ ξ_{i} (α))}^{2}} \end{matrix}

(32)

Combining (30) and (32), the gradient of

f_{i} (α)

with respect to

α

can be represented by

\begin{matrix} \nabla f_{i} (α) = \frac{V D_{1} V^{T} α}{l_{2}} - \frac{C_{1} θ V E_{1} h {(x_{i})}^{T}}{{(1 + s g n (ξ_{i} (α)) θ ξ_{i} (α))}^{2}} s g n (ξ_{i} (α)) \end{matrix}

(33)

Then, the average gradient over l samples can be computed as

\begin{matrix} \nabla R_{l_{2}} & = \frac{1}{l_{2}} \sum_{i = 1}^{l_{2}} \nabla f_{i} (α) \\ = \frac{V D_{1} V^{T} α}{l_{2}} - \frac{C_{1} θ V E_{1}}{l_{2}} \sum_{i = 1}^{l_{2}} \frac{h {(x_{i})}^{T}}{{(1 + s g n (ξ_{i} (α)) θ ξ_{i} (α))}^{2}} s g n (ξ_{i} (α)) \end{matrix}

(34)

The same can be said for the following:

\begin{matrix} \nabla f_{j} (γ) & = \frac{U D_{2} U^{T} γ}{l_{1}} - \frac{C_{2} θ U E_{2} h {(x_{j})}^{T}}{{(1 + s g n (η_{j} (γ)) θ η_{j} (γ))}^{2}} s g n (η_{j} (γ)) \end{matrix}

(35)

\begin{matrix} \nabla R_{l_{1}} & = \frac{U D_{2} U^{T} γ}{l_{1}} - \frac{C_{2} θ U E_{2}}{l_{1}} \sum_{j = 1}^{l_{1}} \frac{h {(x_{j})}^{T}}{{(1 + s g n (η_{j} (γ)) θ η_{j} (γ))}^{2}} s g n (η_{j} (γ)) \end{matrix}

(36)

where

D_{2} = {(V^{T} V + ϵ_{2} I)}^{- 1} V^{T} V {(V^{T} V + ϵ_{2} I)}^{- 1}, η_{j} (γ) = 1 - \sum_{m = 1}^{l_{1}} γ_{m} h (x_{j}) D_{2} h {(x_{m})}^{T}

,

E_{2} = {(V^{T} V + ϵ_{2} I)}^{- 1}

. From the above equation, we can obtain

α^{*}, γ^{*}

; when we obtain a new sample vector x, we can make predictions according to the following decision function:

\begin{matrix} f (x) = a r g m i n {| - h (x) {(U^{T} U + ϵ_{2} I)}^{- 1} V^{T} α^{*} |, | h (x) {(V^{T} V + ϵ_{2} I)}^{- 1} U^{T} γ^{*} |} \end{matrix}

(37)

3.3. Nonlinear SCTELM

Sometimes, we will encounter datasets that are not linearly separable. In such a case, we can extend SCTELM to the nonlinear case by considering the surfaces generated by the following two kernels:

\begin{matrix} K_{E L M} (x^{T}, G^{T}) β_{1} = 0, \\ K_{E L M} (x^{T}, G^{T}) β_{2} = 0, \end{matrix}

(38)

where

G = (A; B)

, and

K_{E L M}

is the kernel function of ELM, which can be expressed as

K_{E L M} = h (x_{i}) h (x_{j})

. From this, the original problem of TELM can be expressed as follows:

\begin{matrix} min_{β_{1}, ξ} \frac{1}{2} {∥ K_{E L M} (A^{T}, G^{T}) β_{1} ∥}_{2}^{2} + C_{1} e_{2}^{T} ξ \\ s . t - K_{E L M} (B^{T}, G^{T}) β_{1} + ξ \geq e_{2}, ξ \geq 0 . \end{matrix}

(39)

\begin{matrix} min_{β_{2}, η} \frac{1}{2} {∥ K_{E L M} (B^{T}, G^{T}) β_{2} ∥}_{2}^{2} + C_{2} e_{1}^{T} η \\ s . t K_{E L M} (A^{T}, G^{T}) β_{2} + η \geq e_{1}, η \geq 0 . \end{matrix}

(40)

From the KKT condition, we can obtain

β_{1} = - {(R^{T} R + ϵ I)}^{- 1} S^{T} α

and

β_{2} = {(S^{T} S + ϵ I)}^{- 1} R^{T} γ

, where

R = K_{E L M} (A^{T}, G^{T})

and

S = K_{E L M} (B^{T}, G^{T})

, and

α a n d γ

are Lagrange multipliers. The original problem for SCTELM then becomes the following:

SCTELM1:: $min_{β_{1}} \frac{1}{2} {∥ K_{E L M} (A^{T}, G^{T}) β_{1} ∥}^{2} + C_{1} \sum_{i = 1}^{l_{2}} (1 - \frac{1}{1 + θ | 1 + h (x_{i}) β_{1} |})$

(41)
SCTELM2:: $min_{β_{2}} \frac{1}{2} {∥ K_{E L M} (B^{T}, G^{T}) β_{2} ∥}^{2} + C_{2} \sum_{j = 1}^{l_{2}} (1 - \frac{1}{1 - θ | 1 + h (x_{j}) β_{2} |})$

(42)

Following the way linear SCTELM is computed, we can obtain the following, according to SCTELM1:

\begin{matrix} \nabla f_{i} (α) = \frac{V D_{3} S^{T} α}{l_{2}} - \frac{C_{1} θ S E_{3} h {(x_{i})}^{T}}{{(1 + s g n (ξ_{i} (α)) θ ξ_{i} (α))}^{2}} s g n (ξ_{i} (α)) \end{matrix}

(43)

\begin{matrix} \nabla R_{l_{2}} = = \frac{S D_{3} S^{T} α}{l_{2}} - \frac{C_{1} θ S E_{3}}{l_{2}} \sum_{i = 1}^{l_{2}} \frac{h {(x_{i})}^{T}}{{(1 + s g n (ξ_{i} (α)) θ ξ_{i} (α))}^{2}} s g n (ξ_{i} (α)) \end{matrix}

(44)

and according to SCTELM2:

\begin{matrix} \nabla f_{j} (γ) & = \frac{R D_{4} R^{T} γ}{l_{1}} - \frac{C_{2} θ R E_{4} h {(x_{j})}^{T}}{{(1 + s g n (η_{j} (γ)) θ η_{j} (γ))}^{2}} s g n (η_{j} (γ)) \end{matrix}

(45)

\begin{matrix} \nabla R_{l_{1}} & = \frac{R D_{4} R^{T} γ}{l_{1}} - \frac{C_{2} θ R E_{4}}{l_{1}} \sum_{j = 1}^{l_{1}} \frac{h {(x_{j})}^{T}}{{(1 + s g n (η_{j} (γ)) θ η_{j} (γ))}^{2}} s g n (η_{j} (γ)) \end{matrix}

(46)

where

D_{3} = {(R^{T} R + ϵ_{2} I)}^{- 1} R^{T} R = R {(R^{T} R + ϵ_{2} I)}^{- 1}, E_{3} = {(R^{T} R + ϵ_{2} I)}^{- 1}, ζ_{i} (α) = 1 - \sum_{k = 1}^{l_{2}} α_{k} h (x_{i}) E_{3} h {(x_{k})}^{T}, D_{4} = {(R^{T} R + ϵ_{2} I)}^{- 1} R^{T} R = R {(R^{T} R + ϵ_{2} I)}^{- 1}, E_{4} = {(R^{T} R + ϵ_{2} I)}^{- 1}, ρ_{j} (γ) = 1 - \sum_{m = 1}^{l_{1}} γ_{m} h (x_{j}) E_{4} h {(x_{m})}^{T} .

In the same way, from the above equation, we can obtain

α^{*}, γ^{*}

. When we obtain a new sample vector x, we can make predictions according to the following decision function:

\begin{matrix} f (x) = a r g m i n {| - h (x) {(R^{T} R + ϵ_{2} I)}^{- 1} S^{T} α^{*} |, | h (x) {(S^{T} S + ϵ_{2} I)}^{- 1} R^{T} γ^{*} |} \end{matrix}

(47)

3.4. Convergence Analysis

Consider a fixed-stage n in Algorithm 1. Let

ω^{*} = arg min_{α} f (α)

. Assume that m is sufficiently large and, therefore,

\begin{matrix} η = \frac{1}{P τ (1 - 2 K τ) T} + \frac{2 K τ}{1 - 2 K τ} < 1 \end{matrix}

(48)

Then, we can determine that the expectation on SVRG has geometric convergence, as follows:

\begin{matrix} E [f ({\tilde{ω}}_{n})] \leq E [f (ω^{*})] + η E [f ({\tilde{ω}}_{n - 1}) - f (ω^{*})] \end{matrix}

(49)

where

K > P > 0 a n d τ < \frac{1}{K}

. For a detailed proof, see the work of Johnson and Zhang (2013) [25].

Algorithm 1 SVRG for SCTELM

Input: Training data

A \in R^{l_{1} \times n}

and

B \in R^{l_{2} \times n}

; Parameters

θ, C_{i}, (i = 1, 2)

,

ε_{i}, (i = 1, 2)

, update frequency m and learning rate

τ

.
Output:

{\tilde{ω}}_{n}

.
Initialize

{\tilde{ω}}_{0}

;

FOR n = 1, 2, 3, ⋯, N;
$\tilde{ω} = {\tilde{ω}}_{n - 1}$ ;
Calculate $\nabla R_{l_{1}} a n d \nabla R_{l_{2}}$ according to (34) and (36) for SCTELM and (44) and (46) for Nonlinear SCTELM;
FOR t = 1, 2, 3, ⋯, T;
Randomly pick $i_{t} \in {1, 2, 3, \dots, l}$ ;
Update weight $ω_{t} = ω_{t - 1} - τ (\nabla f_{i_{t}} (ω_{t - 1}) - \nabla f_{i_{t}} (\tilde{ω}) + \nabla R_{l} (\tilde{ω}))$ according to (33) and (35) for SCTELM and (43) and (45) for Nonlinear SCTELM;
END
${\tilde{ω}}_{n} = ω_{m}$ ;

END

3.5. Computational Complexity Analysis

In this section, we briefly analyze the complexity of the proposed algorithm. Here, we only compute the linear SCTELM, and similarly, the nonlinear SCTELM, which is well known to be determined by the computational cost and the number of iterations. First, we analyze the former; the computational complexity of

D_{1}

and

D_{2}

in (33) and (35) is

O (l \times L^{2} + L^{3})

, while the computational complexity of

E_{1}

and

E_{2}

is

O (l \times L^{2} + L^{3})

, so the overall complexity of (33) and (35) is

O (l \times L^{2} + L^{3})

. Similarly, the complexity of (34) and (36) is

O (L^{3})

. In Algorithm 1, the outer loop is executed N times and the inner loop is executed T times, so the total computational cost of the algorithm is

O (N T (l \times L^{2} + L^{3}))

. While L is the number of neurons, there is usually

L ≪ l

, so the complexity of Algorithm 1 is

O (N T l \times L^{2})

. It can be seen that the complexity of the SCTELM algorithm is mainly affected by the number of samples, and the convergence speed of the SVRG algorithm is faster than that of other algorithms, so the convergence speed of the SCTELM algorithm is faster than that of other algorithms.

SVRG reduces the variance of the stochastic gradient by introducing an estimate of the global gradient, thus improving the convergence speed. This means that in each iteration, SVRG can use fewer samples (mini-batch) to achieve the same accuracy as traditional SGD. This variance reduction property makes the algorithm more efficient when dealing with large-scale data, thus improving the scalability. Each iteration of SVRG needs to calculate the global gradient, which increases a certain computational overhead compared with only calculating the gradient of local samples. However, due to its faster convergence speed, the same performance can usually be achieved in a smaller number of iterations, thus reducing the overall computation time on large-scale datasets.

4. Numerical Experiments

In this section, to evaluate the robustness of SCTELM, we systematically compare it with other advanced algorithms in multiple datasets, including ELM [2], LELM [26], CTSVM [18],CHELM [27], TELM [24], and FRTELM [28]. For ease of observation, the bold font in the table below indicates the optimum. After dataset testing, we also performed classification visualization to further embody the performance of our proposed model. All experiments were performed using the Windows operating system with a 12th generation Intel (R) core (TM), i7-6700HQ @ 2.10 GHz, and 8.00 GB RAM, and the environment was implemented on a PC running a 64-bit XP operating system. All codes were run by MATLAB R2021b.

4.1. Experimental Setup

We can evaluate the performance of an algorithm based on different metrics, but here, we will use accuracy (ACC), which measures the proportion of instances that the model correctly predicted. ACC is defined as follows:

\begin{matrix} A C C = \frac{T P + T N}{T P + F P + T N + F N} \end{matrix}

(50)

where TP and TN represent true positive and true negative, respectively, which represent the correct prediction of positive and negative samples; FP and FN represent false positive and false negative, respectively, which represent the incorrect prediction of positive and negative samples. Among them, a higher ACC value represents a more accurate classification and a better model performance.

In our experiment design, the training and testing samples were randomly selected from the dataset, and we added outliers of 0%, 10%, 20%, and 30% to the training set, respectively, and used these contaminated training samples to verify the robustness of the model. In this experiment, due to the large number of parameters, we used the random search method to find the best parameters. The parameters selected for the proposed model are as follows: parameter

C_{i}

is selected from the set

{10^{i} | i = - 5, - 4, - 3, - 2, - 1, 0, 1, 2, 3, 4, 5}

, and parameter

ε_{i}

is selected from the set

{10^{i} | i = - 5, - 4, - 3, - 2, - 1, 0, 1, 2, 3, 4, 5}

.

4.2. Description of the Datasets

To verify the performance of our proposed SCTELM, we will numerically simulate it using various datasets, including eight benchmark datasets from UCI and two artificial datasets.

Among them, the UCI dataset includes the following: Australian (Australian Credit Approval), Ionosphere, German, Pima, Vote (Congressional Voting Records), WDBC (Breast Cancer Wisconsin), Spect (SPECTF Heart) and QSAR (QSAR biodegradation). We will use these UCI datasets to compare the performance of our proposed algorithm with other algorithms. See Table 1 for relevant information on these datasets.

In terms of artificial datasets, we generated a two moons dataset containing two categories, 1000 positive and 1000 negative, and a dataset containing three categories. See Figure 3 for the detailed dataset, where the positive and negative classes are represented by green and orange spheres. And the three-classification dataset contains 900 samples, and each class has 300 samples. See Figure 4 for details.

We performed a 10-fold cross-validation on the selected datasets, which means that we randomly split one of the datasets into ten parts, nine of which were used as the training set and the remaining one as the test set. This process was repeated ten times, and the average of the ten results was taken as the final performance metric to reduce the risk of overfitting and underfitting. At the same time, to obtain more objective experimental results, we normalized all the datasets to make the data in the interval [0, 1]. Considering that we want to verify the robust performance of the model, we will successively increase the noise ratio for each dataset. If the accuracy of classification does not change much with the increase in noise, it indicates that the algorithm has good robust performance.

4.3. Experimental Results on Artificial Datasets

In order to verify the advantages of the proposed model over other original models, only ELM and TELM will be compared in this section. Observing Table 2, it can be found that when the noise ratio is zero, the accuracy of SCTELM is higher than that of TELM and ELM, which indicates that SCTELM performs better than TELM and ELM. When the noise ratio is 0.1, the accuracy of the model is all reduced, which indicates that noise does affect the performance of the model. When the noise ratio is 0.2, the accuracy of all the models slightly decreases, but the accuracy of SCTELM is higher than that of TELM and ELM. When the noise ratio is 0.3, the accuracy of SCTELM decreases slightly, while the accuracy of ELM and TELM decreases more, which clearly illustrates the robustness of SCTELM.

To make a more intuitive comparison, the line chart of the proposed model and other models under different noises is drawn in Figure 5. It can be seen from Figure 5 that the corresponding line of SCTELM is always higher than that of the other models, which indicates that the performance of SCTELM is better than that of the other two models. When the noise gradually increases, the accuracy of all the models shows a decreasing trend, which indicates that noise does have a large impact on the performance of the model. It can be seen from the figure that when the noise rises from 0.2 to 0.3, the accuracy decline rate of ELM and TELM is faster than that of SCTELM, which also precisely demonstrates the superior robustness of the proposed model.

To extend SCRELM to the multi-class case, we adopted an artificial dataset with three classes. Generally speaking, either one-to-one or one-to-many strategy is used when using models such as ELM or TELM for multi-classification. Here, we use the more commonly used one-to-one strategy for classification. The strategy decomposes a three-class classification problem into three two-class classification problems, each of which is responsible for solving a pair of classes. The final classification results are determined by voting technology. The final results of classification are given in Table 3. As expected, SCTELM shows its amazing ability and outperforms ELM and TELM in the final classification accuracy.

4.4. Experimental Results on UCI Datasets

In this section, to validate the classification performance algorithms, we ran them on eight datasets. We preliminarily verified the robustness of the SCTELM algorithm on the previous artificial datasets. Here, to further verify its robustness, we will add 20% and 30% Gaussian noise to all datasets. See Table 4, Table 5 and Table 6 for details. However, the choice to set 20% or 30% noise in the SCTELM algorithm experiment can, on the one hand, simulate a data bias that may exist in the real environment to evaluate the performance of the model in the face of noisy data. A clearer understanding of the anti-noise performance of the SCTELM algorithm can be obtained by comparing the experimental results under different noise levels. For example, if the model performs well under 30% noise, it is robust to noise. On the other hand, the increase in noise level will increase the complexity and randomness of the data, making the distribution difference between the training data and the test data larger. Whether the SCTELM model can maintain good generalization performance when the noise is high reflects its ability to adapt to unseen data. All in all, this setting is useful for assessing how well the model generalizes on test sets or real-world application data.

4.4.1. Experimental Results on UCI Datasets Without Outliers

From Table 4, we can clearly find that the classification performance of our proposed SCTELM is better than the other five algorithms on all datasets. In addition, we also find that TELM and CTSVM are faster than the other algorithms on most datasets, this is because they solve one large QPPs instead of two small QPPs. Our proposed algorithm also inherits this advantage. It can be clearly seen in Table 4 that the learning efficiency of SCTELM is ahead of other algorithms in most datasets.

4.4.2. Experimental Results on UCI Datasets with Outliers

We verified the superiority of SCTELM in the previous experiments. To further verify the robustness of SCTELM, we added a certain proportion of outliers in each dataset and conduct new experiments. In this experiment, we chose to add 20% and 30% Gaussian noise, respectively, to test the robustness of the model. Table 5 and Table 6 present the experimental results of the six models with 20% and 30% noise introduced, respectively. From Table 5 and Table 6, we can intuitively find that the accuracy of each model decreases after the introduction of noise, which indicates that noise pollution indeed has a great impact on model classification. At the same time, we can also find that the accuracy of the SCTELM model still maintains its superiority, and it can be concluded that the robustness of our proposed SCTELM model is significantly better than the robustness of other models.

In order to improve our intuition, we drew Figure 6, Figure 7, Figure 8 and Figure 9 according to the noise ratio and accuracy. From the figure, we can clearly see that the accuracy decreases when the noise increases, and we can also intuitively see that the “slope” of SCTELM is small. It further shows that SCTELM has strong robustness.

4.5. Statistical Analysis

In this section, to analyze the significant differences in the six presented algorithms on these eight UCI datasets, we conducted the Friedman test [29], which is considered as a non-parametric test with low data distribution requirements, strong processing power, and easy interpretation of the results. In this test, the null hypothesis

H_{0}

is as follows: there is no significant difference between the six algorithms; and if the null hypothesis is rejected, then the Nemeny test will be performed [29]. The average ranking and average accuracy of these six algorithms on the eight datasets are shown in Table 4, Table 5 and Table 6.

First, we can make a statistical comparison according to the noise-free case, and according to the Friedman statistic formula, it can be obtained as follows:

\begin{matrix} χ_{F}^{2} = \frac{12 N}{k (k + 1)} [\sum_{j} R_{j}^{2} - \frac{k {(k + 1)}^{2}}{4}] = 35.57 \end{matrix}

where N is the number of datasets, k is the number of algorithms, and

R_{j}

is the average ranking of the jth algorithm over all datasets. In this paper, k = 7 and N = 8.

Second, we can obtain the

χ_{F}^{2}

-distribution with

(k - 1)

degrees of freedom, as follows:

\begin{matrix} F_{F} = \frac{(N - 1) χ_{F}^{2}}{N (k - 1) - χ_{F}^{2}} = 20.03 \end{matrix}

Similarly, we can obtain that the

χ_{F}^{2}

values of noise 0.2 and 0.3 are 32.41 and 40.07, respectively, and the

F_{F}

values are 9.12 and 20.48, respectively. For

α = 0.05

, we can obtain

F_{α} = (6, 42) = 2.32

. Clearly, all

F_{F}

has

F_{F} > F_{α}

. Therefore, it can be said that there are significant differences between the six algorithms.

In the following, to further compare the six algorithms, the Nemenyi post hoc test will be performed. Referring to the table, we obtain

q_{α = 0.05} = 2.85

, which gives us the critical difference (CD):

\begin{matrix} C D = q_{α = 0.05} \sqrt{\frac{k (k + 1)}{6 N}} = 2.85 \times \sqrt{\frac{7 \times 8}{6 \times 8}} = 3.078 \end{matrix}

If the difference between the rankings of the two algorithms calculated is larger than the critical value, it represents a significant difference between the two algorithms. From Table 4, we can obtain the difference between SCTELM and the other five algorithms when there is no noise, and the calculation results are as follows:

\begin{matrix} D (E L M - S C T E L M) = 6.75 - 1.13 = 5.62 > 3.078 \\ D (T E L M - S C T E L M) = 4 - 1.13 = 2.87 < 3.078 \\ D (L E L M - S C T E L M) = 4.5 - 1.13 = 3.37 > 3.078 \\ D (C T S V M - S C T E L M) = 4.88 - 1.13 = 3.75 > 3.078 \\ D (C H E L M - S C T E L M) = 4.63 - 1.13 = 3.5 > 3.078 \\ D (F R T E L M - S C T E L M) = 2.13 - 1.13 = 1 < 3.078 \end{matrix}

The notation D (A-B) represents the difference in average rankings between the two algorithms. We can conclude that SCTELM significantly outperforms ELM, LELM, CTSVM and CHELM on the noise-free dataset, while showing no significant difference compared to TELM and FRTELM. According to Table 4 and Table 5, it is clear that in datasets with 20% and 30% Gaussian noise, SCTELM shows a significant difference compared to ELM, TELM and LELM, but there is little difference compared to CTSVM and CHELM. To make the conclusions clear, the post hoc detection is visualized below in Figure 10.

In Figure 10, the abscissa represents the average ranking of each algorithm, and the ordinate represents the algorithms to be compared. If two algorithms overlap with respect to the abscissa, then there is no significant difference between them. From Figure 10, we can easily draw this conclusion.

4.6. Parameter Analysis

To make it more convenient for other researchers to use SCTELM, the performance of SCTELM under different parameters will be further studied from the perspective of parameters, so we chose four representative datasets Australian, German, Spect and Ionosphere). Starting from the parameters C1, C2 and

θ

, the sensitivity of the proposed algorithm to these parameters is analyzed. At the same time, the other parameters are fixed for experiments. The experimental results are shown in Figure 11 and Figure 12.

For the parameters C1 and C2, we observe that on these datasets, the ACC values show large fluctuations as the parameters change, and the fluctuations are particularly significant on some datasets. For example, on the German dataset, when C1 and C2 are changed, the fluctuation amplitude of ACC values is close to 60%, which indicates that this dataset is very sensitive to parameter changes. In contrast, the fluctuations of datasets like Australian are relatively small, about 30%. Through further analysis, it is found that the ACC value rises significantly as the parameter C1 decreases and reaches its maximum value at the value of C1 equal to

10^{- 4}

, indicating that a smaller C1 is more beneficial to improve the classification accuracy. Therefore, parameter C1 is suitable for selecting a smaller value to optimize the algorithm performance. On the other hand, the effect of changing the parameter C2 on the results shows a linear trend. As the value of C2 increases, the ACC value also gradually rises, and the best classification effect is achieved when the value of C2 is

10^{4}

. This indicates that appropriately increasing C2 is helpful to further improve the classification performance of the algorithm.

Compared with parameters C1 and C2, parameter

θ

shows relatively stable characteristics in the experiment, especially on the Australian dataset; its maximum fluctuation range is less than 2%, showing extremely strong stability. However, on the Ionosphere dataset, the parameter

θ

fluctuates by nearly 10%, indicating that there is still some sensitivity of

θ

on some datasets. Further analysis of the experimental results shows that the influence of the parameter

θ

on the classification accuracy shows a nonlinear trend. Setting

θ

in

[10^{- 3}, 10^{3}]

, the algorithm performs best, indicating that the

θ

value in this range can effectively optimize the classification performance.

4.7. Convergence Analysis Experiments

To verify the convergence of the proposed SCTELM algorithm, experiments will be carried out on the Austrasta, German, Spect and Ionosphere datasets in this section. The experiments are similar to those carried out in Section 4.4.2, while the relevant parameters are set to establish the optimal parameters. The experimental results are shown in Figure 13.

In Figure 13, it is easy to see that the convergence condition of SCTELM steadily decreases with the increase in the number of iterations, and it converges to a stable value after a finite number of iterations, so it can be seen that the proposed SCTELM algorithm can converge within a finite number of iterations.

5. Conclusions

This paper proposes a novel robust twin extreme learning machine that employs a newly designed soft capped

L_{1}

-norm loss function, designated as SCTELM for brevity. In contrast to the conventional hard capped methodology, the proposed soft capped approach effectively circumvents the potential non-differentiability issues associated with hard capped, whilst retaining the beneficial aspects of the hard capped technique. The objective of this approach is to mitigate the impact of outliers on the model, thereby increasing its robustness. Furthermore, the SCTELM only requires the resolution of a pair of relatively small quadratic programming problems (QPPs), which significantly reduces the complexity associated with the resolution of large QPPs in the past. Consequently, this enhances the learning efficiency of the model. The proposed model demonstrates enhanced stability and increased computational efficiency in the presence of data anomalies. The experimental results on noise-free and noisy datasets indicate that the SCTELM model exhibits heightened robustness compared to traditional methods. This is particularly evident in datasets containing outliers, where the “slope” of the SCTELM model exhibits a more gradual incline, and the learning efficiency is superior to other methods.

Our research is of considerable importance within the academic field. This paper introduces a novel loss function based on soft capped, which enables the construction of a robust twin extreme learning machine. This provides a new perspective and methodology for the field of machine learning and has the potential to encourage subsequent researchers to pursue innovative avenues in the design of loss functions and learning algorithms, thereby promoting the further advancement of both theoretical and practical developments. Furthermore, in the context of real-world data, such as the credit card approval dataset from the University of California, Irvine (UCI), users can be classified into two distinct categories: those who are approved for credit and those who are not. It is evident that SCTELM has gained significant advantages in terms of robustness to outliers in comparison to alternative algorithms. To illustrate, WDBC represents the Diagnostic Wisconsin Breast Cancer Database, and the SCTELM approach has also yielded significant benefits. It follows, therefore, that our research has significant practical implications for the fields of finance and medical treatment. The model is capable of facilitating the provision of real-time analysis and decision support to end users. The enhancement of accuracy and stability in the model enables enterprises and healthcare facilities to effectively address decision-making challenges in intricate contexts. The model may be utilized in a multitude of domains, including, although not limited to, image processing, signal processing, and natural language processing. These fields often have to contend with data noise and uncertainty, which robust models can help to overcome, thereby improving the accuracy and reliability of results. In conclusion, our study offers a significant contribution to the field of machine learning, providing a valuable perspective on robustness research and its potential for practical applications, which will inform future research and developments.

It should be noted that the algorithm is not without its limitations. At the commencement of each cycle, the SVRG algorithm is required to calculate the full gradient of the entire dataset. Although this will facilitate convergence more rapidly than the conventional SGD algorithm, the associated computational cost may be considerable in the context of large-scale datasets, potentially leading to a reduction in the rate of algorithmic execution. In future work, it would be worthwhile to consider employing periodic random sampling in place of calculating the full gradient each time, or to investigate combining the SVRG algorithm with an adaptive learning rate optimization algorithm. This would allow the learning rate to be adaptively changed in accordance with the gradient.

Author Contributions

Conceptualization, Z.X. and J.M.; Methodology, Z.X. and J.M.; Software, Z.X.; Verification, Z.X. and B.W.; Resource acquisition. G.Y. and B.W.; Data management, G.Y.; Writing—first draft preparation, Z.X.; Writing—reviewing and editing, B.W., J.M. and G.Y.; Visualization, Z.X., B.W. and G.Y.; Supervising, G.Y. and B.W.; Project Management, G.Y., B.W. and J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Ningxia Province (No. 2024AAC05055, No. 2023AAC02053), in part by the Fundamental Research Funds for the Central Universities of North Minzu University (No. 2023ZRLG01, No. 2021JCYJ07), in part by the National Natural Science Foundation of China (No. 62366001, No. 12361062), in part by the Key Research and Development Program of Ningxia (Introduction of Talents Project) (No. 2022BSB03046), and in part by the Postgraduate Innovation Project of North Minzu University (YCX24280).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alzaqebah, A.; Aljarah, I.; Al-Kadi, O. A hierarchical intrusion detection system based on extreme learning machine and nature-inspired optimization. Comput. Secur. 2023, 124, 102957. [Google Scholar] [CrossRef]
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: A new learning scheme of feedforward neural networks. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks, Budapest, Hungary, 25–29 July 2004; pp. 985–990. [Google Scholar]
Zheng, G.; Hua, W.; Qiu, Z.; Gong, Z. Detecting Water Depth from Remotely Sensed Imagery Based on ELM and GA-ELM. J. Indian Soc. Remote Sens. 2021, 49, 947–957. [Google Scholar] [CrossRef]
Yang, H.; Zhao, Y.; Zhao, Y.; Chen, N. Drivers’ visual interaction performance of on-board computer under different heat conditions: Based on ELM and entropy weight. Sustain. Cities Soc. 2022, 81, 103835. [Google Scholar] [CrossRef]
Wang, H.; Li, C.; Guan, T.; Zhao, S. No-reference stereoscopic image quality assessment using quaternion wavelet transform and heterogeneous ensemble learning. Displays 2021, 69, 102058. [Google Scholar] [CrossRef]
Zhan, W.; Wang, K.; Cao, J. Elastic-net based robust extreme learning machine for one-class classification. Signal Process. 2023, 211, 109101. [Google Scholar] [CrossRef]
Wu, Q.; Wang, F.; An, Y.; Li, K. L₁-Norm Robust Regularized Extreme Learning Machine with Asymmetric C-Loss for Regression. Axioms 2023, 12, 204. [Google Scholar] [CrossRef]
Lu, X.; Ming, L.; Liu, W.; Li, H.X. Probabilistic regularized extreme learning machine for robust modeling of noise data. IEEE Trans. Cybern. 2017, 48, 2368–2377. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Q.; Hu, J. An adaptive learning algorithm for regularized extreme learning machine. IEEE Access 2021, 9, 20736–20745. [Google Scholar] [CrossRef]
Chen, K.; Lv, Q.; Lu, Y.; Dou, Y. Robust regularized extreme learning machine for regression using iteratively reweighted least squares. Neurocomputing 2017, 230, 345–358. [Google Scholar] [CrossRef]
Ren, Z.; Yang, L. Robust extreme learning machines with different loss functions. Neural Process. Lett. 2019, 49, 1543–1565. [Google Scholar] [CrossRef]
Luo, L.; Wang, K.; Lin, Q. Robust Extreme Learning Machine Based on p-order Laplace Kernel-Induced Loss Function. Int. J. Adv. Comput. Sci. Appl. 2024, 14, 1281–1291. [Google Scholar] [CrossRef]
Meng, D.; Zhao, Q.; Xu, Z. Improve robustness of sparse PCA by L₁-norm maximization. Pattern Recognit. 2012, 45, 487–497. [Google Scholar] [CrossRef]
Zhang, K.; Luo, M. Outlier-robust extreme learning machine for regression problems. Neurocomputing 2015, 151, 1519–1527. [Google Scholar] [CrossRef]
Jiang, W.; Nie, F.; Huang, H. Robust dictionary learning with capped L₁-norm. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 3590–3596. [Google Scholar]
Wang, C.; Ye, Q.; Luo, P.; Ye, N.; Fu, L. Robust capped L₁-norm twin support vector machine. Neural Netw. 2019, 114, 47–59. [Google Scholar] [CrossRef]
Chen, M.; Wang, Q.; Chen, S.; Li, X. Capped L₁-norm sparse representation method for graph clustering. IEEE Access 2019, 7, 54464–54471. [Google Scholar] [CrossRef]
Li, Y.; Sun, H.; Yan, W.; Cui, Q. R-CTSVM+: Robust capped L₁-norm twin support vector machine with privileged information. Inf. Sci. 2021, 574, 12–32. [Google Scholar] [CrossRef]
Nie, F.; Wang, X.; Huang, H. Multiclass capped L_p-Norm SVM for robust classifications. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 2415–2421. [Google Scholar]
Jayadeva; Khemchandani, R.; Chandra, S. Twin Support Vector Machines for Pattern Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar] [CrossRef] [PubMed]
Ding, S.; Hua, X.; Yu, J. An overview on nonparallel hyperplane support vector machine algorithms. Neural Comput. Appl. 2014, 25, 975–982. [Google Scholar] [CrossRef]
Kumar, M.A.; Gopal, M. Least squares twin support vector machines for pattern classification. Expert Syst. Appl. 2009, 36, 7535–7543. [Google Scholar] [CrossRef]
Shao, Y.H.; Zhang, C.H.; Wang, X.B.; Deng, N.Y. Improvements on twin support vector machines. IEEE Trans. Neural Netw. 2011, 22, 962–968. [Google Scholar] [CrossRef]
Wan, Y.; Song, S.; Huang, G.; Li, S. Twin extreme learning machines for pattern classification. Neurocomputing 2017, 260, 235–244. [Google Scholar] [CrossRef]
Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 1, pp. 315–323. [Google Scholar]
Ma, J.; Wen, Y.; Yang, L. Lagrangian supervised and semi-supervised extreme learning machine. Appl. Intell. 2019, 49, 303–318. [Google Scholar] [CrossRef]
Ren, Z.; Yang, L. Correntropy-based robust extreme learning machine for classification. Neurocomputing 2018, 313, 74–84. [Google Scholar] [CrossRef]
Ma, J.; Yang, L.; Sun, Q. Capped L₁-norm distance metric-based fast robust twin bounded support vector machine. Neurocomputing 2020, 412, 295–311. [Google Scholar] [CrossRef]
Demšar, J.; Schuurmans, D. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]

Figure 1. ELM algorithm diagram.

Figure 2. Capped

L_{θ}

-loss

(θ = 0.5, 3, 10)

versus hard capped

L_{1}

-norm loss

(ϵ = 1)

.

Figure 2. Capped

L_{θ}

-loss

(θ = 0.5, 3, 10)

versus hard capped

L_{1}

-norm loss

(ϵ = 1)

.

Figure 3. Two moons dataset distribution.

Figure 4. Three-class dataset distribution.

Figure 5. Comparison of algorithms on artificial datasets.

Figure 6. The accuracy of the six algorithms under different noises: (a) Australian datasets; (b) WDBC datasets.

Figure 7. The accuracy of the six algorithms under different noises: (a) QSAR datasets; (b) German datasets.

Figure 8. The accuracy of the six algorithms under different noises: (a) Vote datasets; (b) Spect datasets.

Figure 9. The accuracy of the six algorithms under different noises: (a) Ionosphere datasets; (b) Pima datasets.

Figure 10. Visualization of critical difference: (a) without noise; (b) with 20% Gaussian noise; and (c) with 30% Gaussian noise.

Figure 11. Histogram of sensitivity of parameters C1, C2 on four datasets: (a) Australian; (b) German; (c) Spect; (d) Ionosphere.

Figure 12. Histogram of sensitivity of parameters

θ

on four datasets: (a) Australian; (b) German; (c) Spect; (d) Ionosphere.

Figure 12. Histogram of sensitivity of parameters

θ

on four datasets: (a) Australian; (b) German; (c) Spect; (d) Ionosphere.

Figure 13. Convergence analysis of SCTELM on four UCI datasets: (a) Australian; (b) German; (c) Spect; (d) Ionosphere.

Table 1. Characteristics of the UCI datasets.

Datasets	Samples	Attributes	Positive Samples	Negative Samples
Australian	690	14	307	383
VOTE	432	16	167	265
WDBC	569	30	357	212
Spect	267	44	212	55
QSAR	1055	41	699	356
Ionosphere	351	34	225	126
German	1000	24	700	300
Pima	768	8	500	268

Table 2. Performance on artificial datasets.

	ELM	TELM	SCTELM
Noise Ratio	ACC (%)	ACC (%)	ACC (%)
0	92.53	93.82	95.91
0.1	90.16	90.50	94.64
0.2	85.31	86.73	91.36
0.3	76.64	77.55	83.45

Table 3. Experimental results of multi-classification artificial datasets.

	ELM	TELM	SCTELM
	ACC (%)	ACC (%)	ACC (%)
Class 1	63.74	77.83	85.17
Class 2	79.31	71.17	84.33
Class 3	72.83	73.00	83.17
Final result	71.85	74.00	84.22

Table 4. Experimental results on UCI datasets without noise.

	ELM	TELM	LELM	CTSVM	CHELM	FRTELM	SCTELM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
Datasets	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	84.37	85.71	84.57	85.64	84.93	85.73	86.32
	1.431	1.983	0.793	0.324	3.368	1.173	0.601
WDBC	95.31	97.32	96.54	97.31	96.86	97.67	98.21
	1.3546	0.924	0.734	0.847	2.736	0.514	0.525
QSAR	83.94	85.63	85.31	85.19	86.17	86.37	86.51
	1.374	0.561	1.351	1.169	3.483	0.437	1.659
German	76.59	77.80	78.62	76.57	76.88	77.63	78.55
	1.709	1.093	2.018	1.687	5.021	0.729	1.486
Vote	95.19	95.52	96.35	95.51	96.47	97.03	97.62
	1.244	0.821	0.465	0.656	4.076	0.519	0.339
Spect	81.37	82.71	82.68	82.25	81.96	82.94	83.62
	0.770	0.831	0.769	1.132	1.982	0.492	0.250
Ionosphere	86.33	88.00	90.67	88.60	87.65	91.08	91.37
	0.589	0.361	0.363	0.399	2.561	0.229	0.264
Pima	76.12	76.31	75.97	76.92	76.78	77.13	77.83
	1.854	1.008	1.105	1.317	3.561	1.347	0.539
Avg. ACC	84.90	86.13	86.34	86.00	85.96	87.07	87.50
Avg. rank	6.75	4.00	4.45	4.88	4.63	2.13	1.13

Table 5. Experimental results on UCI datasets with 20% Gaussian noise.

	ELM	TELM	LELM	CTSVM	CHELM	FRTELM	SCTELM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
Datasets	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	75.74	76.34	75.97	77.49	78.53	79.81	81.88
	1.371	0.983	0.893	0.781	6.368	1.469	0.330
WDBC	85.49	83.72	84.14	86.72	86.95	89.74	92.86
	1.478	1.124	1.715	1.195	8.969	1.649	0.791
QSAR	64.73	66.81	65.29	65.73	65.29	69.52	73.29
	2.314	4.539	2.059	2.994	10.237	2.784	1.728
German	68.93	67.38	68.19	68.71	69.71	71.37	72.30
	2.539	2.348	1.818	1.434	4.320	1.647	0.965
Vote	91.72	93.57	91.29	91.03	94.27	94.54	95.86
	0.909	0.495	0.965	0.938	3.867	1.652	0.439
Spect	70.09	72.49	70.18	80.63	80.34	80.63	81.77
	0.759	1.095	0.819	1.151	5.184	1.525	0.375
Ionosphere	78.37	81.54	79.74	79.35	79.52	83.39	83.82
	0.889	2.539	0.751	0.804	4.264	1.257	0.350
Pima	65.71	65.40	66.74	71.74	72.78	72.37	73.34
	1.804	0.943	0.905	1.032	2.553	1.561	0.711
Avg. ACC	75.10	75.91	75.19	77.68	78.42	80.25	81.89
Avg. rank	6.00	5.13	5.50	4.50	3.63	2.00	1.00

Table 6. Experimental results on UCI datasets with 30% Gaussian noise.

	ELM	TELM	LELM	CTSVM	CHELM	FRTELM	SCTELM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
Datasets	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	64.81	63.62	63.94	67.32	68.01	74.84	75.74
	1.746	1.493	1.513	1.641	6.468	1.426	0.494
WDBC	62.53	63.03	63.53	74.83	74.57	79.41	80.36
	1.71	1.548	1.585	1.835	8.361	1.457	1.451
QSAR	62.79	63.72	62.53	66.36	66.76	69.36	70.04
	2.856	4.792	3.283	2.186	10.593	3.429	1.994
German	67.29	67.98	69.52	62.93	68.71	69.37	69.80
	3.094	2.429	2.043	3.184	4.302	1.827	1.029
Vote	91.61	90.78	92.33	91.97	93.67	93.91	94.10
	0.749	0.765	0.649	1.098	3.625	1.372	0.372
Spect	62.92	65.97	64.42	78.34	78.57	78.87	79.54
	0.963	0.961	0.913	1.241	4.215	1.500	0.321
Ionosphere	72.72	67.04	72.79	75.05	74.01	75.19	75.59
	0.859	0.979	0.811	0.972	3.683	1.017	0.350
Pima	63.91	64.22	62.94	67.88	68.58	70.63	72.11
	1.673	1.263	0.734	0.371	2.102	1.527	0.729
Avg. ACC	68.57	68.30	69.00	73.09	74.11	76.47	77.16
Avg. rank	6.13	5.88	5.38	4.25	3.38	2.00	1.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Z.; Wei, B.; Yu, G.; Ma, J. Robust Twin Extreme Learning Machine Based on Soft Truncated Capped L₁-Norm Loss Function. Electronics 2024, 13, 4533. https://doi.org/10.3390/electronics13224533

AMA Style

Xu Z, Wei B, Yu G, Ma J. Robust Twin Extreme Learning Machine Based on Soft Truncated Capped L₁-Norm Loss Function. Electronics. 2024; 13(22):4533. https://doi.org/10.3390/electronics13224533

Chicago/Turabian Style

Xu, Zhendong, Bo Wei, Guolin Yu, and Jun Ma. 2024. "Robust Twin Extreme Learning Machine Based on Soft Truncated Capped L₁-Norm Loss Function" Electronics 13, no. 22: 4533. https://doi.org/10.3390/electronics13224533

APA Style

Xu, Z., Wei, B., Yu, G., & Ma, J. (2024). Robust Twin Extreme Learning Machine Based on Soft Truncated Capped L₁-Norm Loss Function. Electronics, 13(22), 4533. https://doi.org/10.3390/electronics13224533

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Twin Extreme Learning Machine Based on Soft Truncated Capped L₁-Norm Loss Function

Abstract

1. Introduction

2. Related Work

2.1. Notations

2.2. ELM

2.3. TWSVM

2.4. TELM

3. Main Contributions

3.1. Capped SC-Loss Function

3.2. Linear SCTELM

3.3. Nonlinear SCTELM

3.4. Convergence Analysis

3.5. Computational Complexity Analysis

4. Numerical Experiments

4.1. Experimental Setup

4.2. Description of the Datasets

4.3. Experimental Results on Artificial Datasets

4.4. Experimental Results on UCI Datasets

4.4.1. Experimental Results on UCI Datasets Without Outliers

4.4.2. Experimental Results on UCI Datasets with Outliers

4.5. Statistical Analysis

4.6. Parameter Analysis

4.7. Convergence Analysis Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI