Efficient Optimization of a Support Vector Regression Model with Natural Logarithm of the Hyperbolic Cosine Loss Function for Broader Noise Distribution

Kocaoğlu, Aykut

doi:10.3390/app14093641

Open AccessArticle

Efficient Optimization of a Support Vector Regression Model with Natural Logarithm of the Hyperbolic Cosine Loss Function for Broader Noise Distribution

by

Aykut Kocaoğlu

Department of Electrical and Energy, Dokuz Eylul University, 35380 Izmir, Turkey

Appl. Sci. 2024, 14(9), 3641; https://doi.org/10.3390/app14093641

Submission received: 25 March 2024 / Revised: 21 April 2024 / Accepted: 22 April 2024 / Published: 25 April 2024

(This article belongs to the Topic Advances in Artificial Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

While traditional support vector regression (SVR) models rely on loss functions tailored to specific noise distributions, this research explores an alternative approach:

ε

-ln SVR, which uses a loss function based on the natural logarithm of the hyperbolic cosine function (lncosh). This function exhibits optimality for a broader family of noise distributions known as power-raised hyperbolic secants (PHSs). We derive the dual formulation of the

ε

-ln SVR model, which reveals a nonsmooth, nonlinear convex optimization problem. To efficiently overcome these complexities, we propose a novel sequential minimal optimization (SMO)-like algorithm with an innovative working set selection (WSS) procedure. This procedure exploits second-order (SO)-like information by minimizing an upper bound on the second-order Taylor polynomial approximation of consecutive loss function values. Experimental results on benchmark datasets demonstrate the effectiveness of both the

ε

-ln SVR model with its lncosh loss and the proposed SMO-like algorithm with its computationally efficient WSS procedure. This study provides a promising tool for scenarios with different noise distributions, extending beyond the commonly assumed Gaussian to the broader PHS family.

Keywords:

hyperbolic secant distribution; nonlinear loss functions; nonsmooth optimization; sequential minimal optimization; support vector regression; working set selection

1. Introduction

Support vector regression [1,2] is an extension of the Support Vector Machine (SVM), which was initially introduced to solve classification problems [3,4]. In SVR, the goal is to find a function that best fits the data, providing a good generalization as well as being robust against noise by considering a regularized

ε

-insensitive loss function. The SVR typically solves the Lagrangian dual problem with

2 L

Lagrange multipliers, twice the number of training samples, by taking advantage of the kernel trick that makes it possible to implicitly transform the input data into a higher dimensional space.

The landscape of SVM/SVR optimization methods is rich and diverse. Cutting plane algorithms with improved line search techniques have been used to solve nonsmooth convex linear SVM problems in the primal [5,6]. Proximal gradient-based methods have found success in solving linearly constrained Quadratic Programming (QP) problems with box constraints, applicable to Huberized SVMs and various SVM variants [7,8]. Both subgradient- and gradient-based methods have been explored for solving strongly convex optimization problems, with numerical experiments conducted on soft-margin linear SVMs [9]. Recent studies have reformulated SVR with the

l_{2}

loss function, leading to nonsmooth unconstrained dual problems with L Lagrange multipliers. These studies investigated various approaches, including solving smooth approximations, applying generalized derivatives, and employing functional iterative and Newton methods [10,11,12]. Wang et al. [13] addressed nonsmooth dual problems in nonparallel support vector ordinal regression using an alternating direction method of multipliers, requiring kernel computations at each iteration. Yin and Li [14] introduced a semismooth Newton method for solving both support vector classification and regression problems with the

l_{2}

loss function in the primal.

As the number of training samples in SVM/SVR grows, computational efficiency becomes a major challenge due to the appearance of the massive kernel matrix. To address this issue, the SMO algorithm is used to decompose the problem into smaller subproblems where only two Lagrange multipliers are updated in each iteration. Originally developed by Platt [15] for smooth dual QP problems with

2 L

Lagrange multipliers, the SMO algorithm has undergone numerous developments. Keerthi et al. [16] and Fan et al. [17] introduced first-order and second-order information, respectively, into the WSS procedure, a key component of SMO. Flake and Lawrence [18] introduced an SMO algorithm for solving nonsmooth, indeed piecewise quadratic, optimization problems by dealing with L optimization parameters. Other studies, such as Guo et al. [19] and Takahashi et al. [20] extended this approach by using first-order (FO) information for WSS. Additionally, Kocaoğlu [21,22] further extended the WSS procedure by integrating SO-like information. This extension involved the innovative concept of minimizing an upper bound on the difference between consecutive loss function values, effectively addressing the challenges of solving the piecewise quadratic dual optimization problem. In [23], a WSS procedure was also developed by combining the advantages of the methods in [19,20] based on the FO-like information and the method in [24] for solving piecewise quadratic problem arising in nonparallel SVR. In [25], an SMO algorithm for solving QP problems that arise in LSSVM was developed, taking advantage of handling L variables. This algorithm utilizes the WSS procedure with FO information. Later, in [26], the SMO algorithm for LSSVM was extended by comparing the performance of WSS procedures that employ both first-order and second-order information. In particular, studies [17,21,26] have consistently demonstrated the advantages of SO-based WSS over FO-based approaches in terms of efficiency. More recently, the SMO algorithm for solving QP problems is further improved by the studies [27,28,29].

In several real-world problems, the noise exhibits different distributions rather than a specific distribution such as Gaussian. Thus, beyond traditional

l_{1}

and

l_{2}

loss functions, a diverse landscape of alternatives has emerged, each tailored to specific noise distributions [30,31,32,33,34,35,36,37,38,39,40]. In [30], a novel variant of SVM was introduced, where the traditional hinge loss in SVM was replaced with the pinball loss. The dual QP problem with box constraints was subsequently solved using the SMO algorithm in [31]. Ref. [32] extended it with the squared pinball loss, resulting in an asymmetric least squares SVM, and [33] employs this loss function for SVR with a SMO-based solver. In [34], SVR models with asymmetric Huber and

ε

-insensitive Huber loss functions were presented, leading to strongly convex minimization problems which are solved in the primal by a functional iterative method. The classical ridge regression assumes that the noise follows a Gaussian distribution. However, Ref. [35] revealed that, in certain practical applications, such as wind speed prediction, the noise models may not adhere to a Gaussian distribution. So, in [35], a nonlinear loss function optimal to Beta noise distribution in the maximum likelihood sense was employed for wind speed prediction and the kernel ridge regression with this nonlinear loss was solved by the Augmented Lagrangian Multiplier method. In [36], SVR was formulated with a loss function determined based on the noise distribution in such a way that the optimal loss functions were determined in the maximum likelihood sense for Laplace, Gaussian, Beta, Weibull and Marshall–Olkin generalized exponential distributions. A naive online R minimization algorithm was chosen as the optimization method to solve this dual nonlinear SVR and it was reported that SVR with a loss determined based on noise distribution performs better than classical

ε

-SVR. Ref. [37] proposed a LSSVM and an extreme learning machine with a homotopy loss possessing two tunable parameters, which covers different loss functions such as

l_{1}

-norm loss, logarithmic loss, Geman–Reynolds loss, Geman–McClure loss and correntropy-based loss. Although the proposed loss covers these above-mentioned losses, only the problem of LSSVM with homotopy loss, which becomes equivalent to the reweighted LSSVM model for some specific values of one tunable parameter, was solved via reweighted least squares algorithm. Recently, a convex piecewise linear loss function, namely the

ε

-penalty loss function, with two tunable parameters, where the popular

ε

-insensitive

l_{1}

loss function and the Laplace loss function are particular cases of this loss function, was introduced in [38], and resulting QP and linear programming problems of the SVR models with this loss function were solved by the interior point algorithm. Another convex, continuous and differentiable loss function, namely

l_{s}

loss, was presented in [39] and used to construct two kernel-based regressors for improved noise robustness. The

l_{s}

loss was used in place of the traditional loss function in LSSVR and ELM. An iteratively reweighted least squares method was utilized to optimize these LSSVR and ELM problems. Another study [40] introduced an SVR model with a continuously differentiable convex loss function, namely lncosh loss, which is optimal in the maximum likelihood sense for the hyper-secant error distribution. This loss function has been applied in various fields [41,42,43,44,45,46,47,48] and it was noted in the study [40] that SVR models generated using various parameter settings of the lncosh loss exhibit many of the favorable attributes found in well-known loss functions like Vapnik’s loss, the squared loss and Huber’s loss functions. The solution for the convex problem of

ε

-ln SVR with

2 L

optimization parameters is obtained by an interior point algorithm. However, as the training data increases, the interior method becomes inefficient. Overall, the emergence of diverse loss functions offers exciting opportunities for SVR to adapt to real-world noise and potentially outperform classical approaches. Further research into efficient optimization methods for these promising newcomers is essential to fully exploit their potential.

In this paper, we present a novel approach to SVR that effectively addresses noise distribution diversity and computational efficiency. First, we formulate a primal SVR problem with a modified

ε

-insensitive lncosh loss function, namely

ε

-ln SVR, by using equality constraints and we derive a nonsmooth convex dual problem with the compelling advantage of requiring only L optimization parameters, effectively halving the number compared to previous approach [40]. Secondly, we propose an efficient SMO-like algorithm with a novel and computationally efficient WSS procedure. This algorithm strategically selects two updated parameters associated with the argument that minimizes the upper bound of a second-order Taylor polynomial approximation of consecutive loss function values, enabling the exploitation of SO-like information for solving the nonsmooth dual problem. The modified lncosh loss function, characterized by a single tunable parameter, is defined as

l_{ε} (x; η_{1}) = \{\begin{matrix} 0, & if | x | < ε \\ \frac{1}{η_{1}} ln (cosh (η_{2} (| x | - ε))), & otherwise \end{matrix}

with

η_{2} = \sqrt{\frac{1}{2} ψ_{1} (\frac{1}{2 η_{1}})}

where

ψ_{1} (.)

is the well-known trigamma function. It holds the distinction of being optimal in the maximum likelihood sense for the family of PHS distributions, which encompasses Laplace, Gaussian and hyperbolic secant distributions as special cases [49]. Notably, this lncosh loss function with

ε

-insensitivity becomes equivalent to Vapnik’s loss and

ε

-insensitive

l_{2}

loss functions for some limit values of this tunable parameter, demonstrating its remarkable adaptability.

Evaluation on benchmark datasets demonstrates that

ε

-ln SVR results in better test performance than the state-of-the-art SVR models

ε

-SVR and

ε

-

l_{2}

SVR for optimal values of hyperparameters. Moreover, our proposed SMO-like algorithm, equipped with a novel and computationally efficient WSS procedure utilizing SO-like information, exhibits remarkable efficacy in solving the nonsmooth dual problem of

ε

-ln SVR with only L optimization variables. It outperforms both its counterpart relying on FO information and the smooth counterpart with

2 L

optimization parameters.

The outline of the paper is presented as follows. Section 2 introduces the

ε

-ln SVR problem, its smooth and nonsmooth dual formulations, and shows the influence of the tunable parameter on the loss function and its corresponding noise distributions. Section 3 describes the proposed SMO-like algorithm with a novel and computationally efficient WSS procedure specifically designed to overcome the nonsmooth nonlinear dual problem of

ε

-ln SVR. Section 4 presents the results achieved on several real-world benchmark datasets, and Section 5 discusses both the results and future directions.

2. $ε$ -ln SVR and Its Dual Problem

In this section, an overview of the smooth dual formulation of the

ε

-ln SVR problem is presented and the nonsmooth dual formulation is derived. The primal problem of

ε

-ln SVR can be expressed in two ways: a regularized loss with inequality constraints (1) and a regularized

ε

-insensitive loss with equality constraints (8). While these two formulations are interchangeable, their Lagrangian dual problems differ. The dual of (1) is smooth and has

2 L

Lagrange multipliers as in (7). The dual of (8) is nonsmooth but has the advantage of having L Lagrange multipliers, as obtained in (14).

2.1. The Smooth Dual Problem of $ε$ -ln SVR

Benefiting from [40] and representing the primal problem more accurately by eliminating unnecessary constraints and correcting the explicit

ε

-insensitivity definition of the loss function, the primal problem (1) of

ε

-ln SVR with inequality constraints is obtained as follows:

\begin{matrix} min_{w \in R^{n}, b \in R^{1}} & \frac{1}{2} {∥w∥}^{2} + C \sum_{s = 1}^{L} (l (ξ_{s}) + l (ξ_{s}^{'})) \\ subject to & y_{s} - w^{T} φ (x_{s}) - b \leq ξ_{s} + ε, \\ w^{T} φ (x_{s}) + b - y_{s} \leq ξ_{s}^{'} + ε, \\ \forall s \in {1, \dots, L} \end{matrix}

(1)

where

l (ξ_{s}) = \frac{1}{η_{1}} ln (cosh (η_{2} ξ_{s}))

is the loss function, C is the penalty parameter,

x_{s} \in R^{m}

denotes for the training sample,

y_{s} \in R^{1}

is the desired output for

s \in {1, \dots, L}

,

φ (.) \in R^{m} \to R^{n}

is a nonlinear function and

ε

determines the insensitivity region. The Lagrangian of this problem (1) is then obtained as follows:

\begin{matrix} L (w, b, ξ, ξ^{'}, λ, λ^{'}) & = & \frac{1}{2} {∥w∥}^{2} + C \sum_{s = 1}^{L} (l (ξ_{s}) + l (ξ_{s}^{'})) - \sum_{s = 1}^{L} λ_{s} (ε + ξ_{s} - y_{s} + w^{T} φ (x_{s}) + b) \\ - \sum_{s = 1}^{L} λ_{s}^{'} (ε + ξ_{s}^{'} + y_{s} - w^{T} φ (x_{s}) - b) \\ subject to & λ_{s} λ_{s}^{'} \geq 0, \forall s \in {1, \dots, L} \end{matrix}

(2)

and the optimality conditions become as follows.

\begin{matrix} \partial_{b} L = 0 \Rightarrow \sum_{s = 1}^{L} λ_{s} - λ_{s}^{'} = 0 \end{matrix}

(3)

\begin{matrix} \partial_{w} L = 0 \Rightarrow w = \sum_{s = 1}^{L} α_{s} φ (x_{s}) \end{matrix}

(4)

\begin{matrix} \partial_{ξ_{s}} L = λ_{s} - \frac{C η_{2}}{η_{1}} tanh (η_{2} ξ_{s}) = 0 \Rightarrow ξ_{s} = \frac{1}{η_{2}} {tanh}^{- 1} (\frac{η_{1} λ_{s}}{η_{2} C}), - \frac{η_{2}}{η_{1}} C \leq λ_{s} \leq \frac{η_{2}}{η_{1}} C \end{matrix}

(5)

\begin{matrix} \partial_{ξ_{s}^{'}} L = λ_{s}^{'} - \frac{C η_{2}}{η_{1}} tanh (η_{2} ξ_{s}^{'}) = 0 \Rightarrow ξ_{s}^{'} = \frac{1}{η_{2}} {tanh}^{- 1} (\frac{η_{1} λ_{s}^{'}}{η_{2} C}), - \frac{η_{2}}{η_{1}} C \leq λ_{s}^{'} \leq \frac{η_{2}}{η_{1}} C \end{matrix}

(6)

Substituting (3)–(6) into (2), the following dual smooth optimization problem is obtained as follows:

\begin{matrix} min_{λ, λ^{'} \in R^{L}} & \frac{1}{2} \sum_{s = 1}^{L} \sum_{r = 1}^{L} (λ_{s} - λ_{s}^{'}) K (x_{s}, x_{r}) (λ_{r} - λ_{r}^{'}) + ε \sum_{s = 1}^{L} (λ_{s} + λ_{s}^{'}) - \sum_{s = 1}^{L} y_{s} (λ_{s} - λ_{s}^{'}) \\ - \sum_{s = 1}^{L} (\frac{C}{η_{1}} ln [cosh ({tanh}^{- 1} (\frac{η_{1} λ_{s}}{η_{2} C}))] - \frac{λ_{s}}{η_{2}} {tanh}^{- 1} (\frac{η_{1} λ_{s}}{η_{2} C}) \\ + \frac{C}{η_{1}} ln [cosh ({tanh}^{- 1} (\frac{η_{1} λ_{s}^{'}}{η_{2} C}))] - \frac{λ_{s}^{'}}{η_{2}} {tanh}^{- 1} (\frac{η_{1} λ_{s}^{'}}{η_{2} C})) \\ subject to & \sum_{s = 1}^{L} (λ_{s} - λ_{s}^{'}) = 0, \\ 0 \leq λ_{s}, λ_{s}^{'} \leq \frac{η_{2}}{η_{1}} C, \forall s \in {1, \dots, L} \end{matrix}

(7)

where

K (x_{s}, x_{r}) = φ {(x_{s})}^{T} φ (x_{r})

is the kernel function and

λ = [λ_{1} \dots λ_{L}] \in R^{L}

,

λ^{'} = [λ_{1}^{'} \dots λ_{L}^{'}] \in R^{L}

are the Lagrange multipliers.

2.2. The Nonsmooth Version of $ε$ -ln SVR

The primal optimization problem of

ε

-ln SVR in (1) can be equivalently formulated with equality constraints as follows.

\begin{matrix} min_{w \in R^{n}, b \in R^{1}} & \frac{1}{2} {∥w∥}^{2} + C \sum_{s = 1}^{L} l_{ε} (ξ_{s}) \\ subject to & ξ_{s} = y_{s} - w^{T} φ (x_{s}) - b, \forall s \in {1, \dots, L} \end{matrix}

(8)

The continuously differentiable

ε

-insensitive loss function is denoted by the following:

l_{ε} (x; η_{1}, η_{2}) = \{\begin{matrix} 0, & if | x | < ε \\ \frac{1}{η_{1}} ln (cosh (η_{2} (| x | - ε))), & o t h e r w i s e \end{matrix}

(9)

where the penalty parameter is represented by C and the insensitiveness region is determined by

ε

. The Lagrangian of this problem (8) is then obtained as follows:

L (w, b, ξ, α) = \frac{1}{2} {∥w∥}^{2} + C \sum_{s = 1}^{L} l_{ε} (ξ_{s}) - \sum_{s = 1}^{L} α_{s} (ξ_{s} - y_{s} + w^{T} φ (x_{s}) + b)

(10)

and the optimality conditions become as follows:

\begin{matrix} \partial_{b} L = \sum_{s = 1}^{L} α_{s} = 0 \end{matrix}

(11)

\begin{matrix} \partial_{w} L = w - \sum_{s = 1}^{L} α_{s} φ (x_{s}) = 0 \end{matrix}

(12)

\begin{matrix} \partial_{ξ_{s}} L = α_{s} - \frac{C η_{2}}{η_{1}} {tanh}_{ε} (ξ_{s}; η_{2}) = 0 \end{matrix}

(13)

where

{tanh}_{ε} (x; η_{2}) \overset{def}{=} \{\begin{matrix} 0, & if | x | < ε \\ tanh (η_{2} (x - ε)), & if x \geq ε \\ tanh (η_{2} (x + ε)), & o t h e r w i s e \end{matrix}

. Equation (13) implies

ξ_{s} = \frac{1}{η_{2}} {tanh}^{- 1} (\frac{η_{1} α_{s}}{η_{2} C}) + ε s i g n^{*} (α_{s})

where

s i g n^{*} (α_{s})

is defined as

\{\begin{matrix} 1, & if α_{s} > 0 \\ [- 1, 1], & if α_{s} = 0 \\ - 1, & if α_{s} < 0 \end{matrix}

. Substituting it with (11) and (12) into (10), the following dual nonsmooth optimization problem is obtained as follows:

\begin{matrix} min_{α \in R^{L}} J (α) & = & \frac{1}{2} \sum_{s = 1}^{L} \sum_{r = 1}^{L} α_{s} K (x_{s}, x_{r}) α_{r} - \sum_{s = 1}^{L} y_{s} α_{s} - \sum_{s = 1}^{L} (\frac{C}{η_{1}} ln [cosh ({tanh}^{- 1} (\frac{η_{1} α_{s}}{η_{2} C}))] \\ - \frac{α_{s}}{η_{2}} {tanh}^{- 1} (\frac{η_{1} α_{s}}{η_{2} C})) + ε \sum_{s = 1}^{L} | α_{s} | \\ subject to & \sum_{s = 1}^{L} α_{s} = 0, \\ - \frac{η_{2}}{η_{1}} C \leq α_{s} \leq \frac{η_{2}}{η_{1}} C, \forall s \in {1, \dots, L} \end{matrix}

(14)

where

K (x_{s}, x_{r}) = φ {(x_{s})}^{T} φ (x_{r})

is the kernel function and

α = [α_{1} \dots α_{L}] \in R^{L}

are the Lagrange multipliers.

It is worth mentioning that the loss function in (9) becomes optimal in the maximum likelihood sense to a family of PHS distributions as described in [49]. These family of distributions have the following probability density function:

p (x; η_{1}, η_{2}, ε) = \frac{η_{2}}{B (\frac{1}{2 η_{1}}, \frac{1}{2}) + 2 ε η_{2}} {[s e c h_{ε} (x; η_{2})]}^{\frac{1}{η_{1}}}

(15)

where

B (k_{1}, k_{2}) = \int_{0}^{1} t^{k_{1} - 1} {(1 - t)}^{k_{2} - 1} d t

is the Beta function,

s e c h_{ε} (x; η_{2}) \overset{def}{=} \{\begin{matrix} 1, & if | x | < ε \\ s e c h (η_{2} (| x | - ε)), & o t h e r w i s e \end{matrix}

and

η_{1} > 0

.

The study in [49] leads to the following proposition, which highlights the impact of the parameters

η_{1}

and

η_{2}

such that the probability density function (15) becomes equivalent to Laplacian, Gaussian and hyperbolic secant distributions by adjusting these parameters.

Proposition 1.

The distribution defined in (15) becomes equivalent to the well-known distributions for some values of tunable parameters such that

lim_{η_{1} = σ η_{2} \to \infty} p (x; η_{1}, η_{2}, ε) = \frac{1}{2 σ + 2 ε} e^{- \frac{{| x |}_{ε}}{σ}}

(16)

lim_{η_{1} = σ^{2} η_{2}^{2} \to 0} p (x; η_{1}, η_{2}, ε) = \frac{1}{\sqrt{2 π σ^{2}} + 2 ε} e^{- \frac{{(x)}_{ε}^{2}}{2 σ^{2}}}

(17)

p (x; η_{1} = 1, η_{2} = \frac{π}{2 σ}, ε) = \frac{1}{2 σ + 2 ε} s e c h_{ε} (x; \frac{π}{2 σ})

(18)

where (16), (17) and (18) are equivalent to the Laplace, Gaussian and hyperbolic secant distributions for

ε = 0

, respectively.

Proof.

Equation (15) can be rewritten as follows:

p (x; η_{1}, η_{2}, ε) = \frac{η_{2}}{B (\frac{1}{2 η_{1}}, \frac{1}{2}) + 2 ε η_{2}} e^{- l_{ε} (x; η_{1}, η_{2})} .

(19)

Substituting

{lim}_{η_{1} = σ η_{2} \to \infty} \frac{η_{2}}{B (\frac{1}{2 η_{1}}, \frac{1}{2}) + 2 ε η_{2}} = \frac{1}{2 σ + 2 ε}

and

{lim}_{η_{1} = σ η_{2} \to \infty} l_{ε} (x; η_{1}, η_{2}) = \frac{{| x |}_{ε}}{σ}

into (19), it is obvious that (16) holds. Employing Stirling’s approximation of the Beta function, it follows that

{lim}_{η_{1} \to 0} B (\frac{1}{2 η_{1}}, \frac{1}{2}) = \sqrt{2 π η_{1}}

. Substituting this into (19) and considering

{lim}_{η_{1} = σ^{2} η_{2}^{2} \to 0} l_{ε} (x; η_{1}, η_{2}) = \frac{{(x)}_{ε}^{2}}{2 σ^{2}}

, it is obvious that (17) holds. Equation (18) is obtained by substituting

η_{1} = 1

and

η_{2} = \frac{π}{2 σ}

into (15). □

Without loss of generality,

η_{2}

can be chosen as

η_{2} = \sqrt{\frac{1}{2} ψ_{1} (\frac{1}{2 η_{1}})}

, which results in the standardized PHS density functions described in [49], to take advantage of tuning only one parameter. Benefiting from this research, instead of tuning two parameters

η_{1}

and

η_{2}

, it is reduced to tune only one parameter

η_{1}

which continues to cover the Vapnik’s and

ε

-insensitive

l_{2}

losses for the limit values such that

lim_{η_{1} \to \infty} l_{ε} (x; η_{1}, η_{2} = \sqrt{\frac{1}{2} ψ_{1} (\frac{1}{2 η_{1}})}) = \sqrt{2} {| x |}_{ε}

(20)

and

lim_{η_{1} \to 0} l_{ε} (x; η_{1}, η_{2} = \sqrt{\frac{1}{2} ψ_{1} (\frac{1}{2 η_{1}})}) = \frac{{(x)}_{ε}^{2}}{2}

(21)

Debruyne et al. [50] states that, if a kernel function is bounded and its first derivative of the loss function is bounded, then the influence function is also bounded. This makes the lncosh loss function (9) attractive for building robust estimators. The derivative of the lncosh loss function is bounded by

\frac{η_{2}}{η_{1}}

.

η_{2}

is chosen as

η_{2} = \sqrt{\frac{1}{2} ψ_{1} (\frac{1}{2 η_{1}})}

where

ψ_{1} (.)

is the trigamma function. Since for

x \geq 0

it is known that

\frac{1}{x} + \frac{1}{2 x^{2}} \leq ψ_{1} (x) \leq \frac{1}{x} + \frac{1}{x^{2}}

, this bound becomes

\frac{η_{2}}{η_{1}} \leq \sqrt{2 + \frac{1}{η_{1}}}

. This reveals how the parameter

η_{1}

controls robustness: a small

η_{1}

can lead to a large influence function, making the estimator more susceptible to outliers. Conversely, a large

η_{1}

keeps the influence function small, leading to robustness. A bounded influence function signifies that there is a limit to how much a single outlier can affect the overall estimate derived by the model. This characteristic is directly linked to robustness. Since our loss function has a bounded influence function, it suggests the model is less susceptible to the negative effects of outliers, contributing to its overall robustness. In addition, the single-parameter lncosh loss function is demonstrably optimal in the maximum likelihood sense for a broad family of noise distributions, including Laplace, Gaussian and hyperbolic secant as shown in Figure 1e. This adjustable design makes it well-suited for practical applications where the noise distribution is unknown. By making this choice for

η_{2}

, the loss functions (9) for various values of

η_{1}

are illustrated in Figure 1 along with their first derivatives related to influence function and corresponding probability density functions.

While halving the number of optimization parameters compared to its smooth counterpart (7) is a compelling advantage of the nonsmooth dual problem (14), dealing with its nonsmoothness and nonlinearity is a significant challenge. Directly solving this problem using conventional methods can be computationally expensive, potentially hindering real-world applications. To address this challenge, the next section introduces a novel SMO-like algorithm with a computationally efficient WSS procedure specifically designed to navigate the complexities of the nonsmooth nonlinear problem and unlock its efficiency potential.

3. The SMO-like Algorithm for the Nonsmooth Dual Problem of $ε$ -ln SVR

Originally developed to solve QP problems in SVM, SMO is an iterative algorithm that updates only two Lagrangian multipliers at each step, ensuring convergence to the optimal solution. Later, it has been extended for solving piecewise QP problems in SVR [18,19,20,21,22]. In this study, the SMO algorithm is further extended to efficiently solve a more complex nonsmooth nonlinear SVR dual problem (14). The proposed approach achieves its efficiency through several techniques. First, an easy-to-compute WSS procedure is introduced that utilizes the concept of Taylor series approximation and its upper bound to provide SO-like information. Secondly, the nonsmooth problem with half the optimization variables of the classical SVR is derived by employing the subdifferential approach. Finally, the nonsmooth decomposed problem is transformed into a root-finding problem, which is efficiently solved by utilizing Brent’s method. This section describes the proposed SMO algorithm in detail.

The following matrix representation of the convex nonsmooth optimization problem (14) is considered to derive the SMO algorithm.

\begin{matrix} min_{α \in R^{L}} J (α) & = & \frac{1}{2} α^{T} K α + p^{T} α + \sum_{s = 1}^{L} T (α_{s}) + ε {∥α∥}_{1} \\ subject to & u^{T} α = 0, - \hat{C} ⪯ α ⪯ \hat{C} \end{matrix}

(22)

Here,

T (α_{s}) = - \frac{\hat{C}}{η_{2}} ln [cosh ({tanh}^{- 1} (\frac{α_{s}}{\hat{C}}))] + \frac{α_{s}}{η_{2}} {tanh}^{- 1} (\frac{α_{s}}{\hat{C}})

is a nonlinear function with

\hat{C} = \frac{η_{2}}{η_{1}} C

. The optimization variables are denoted by

α = {[\begin{matrix} α_{1} & α_{2} & \dots & α_{L} \end{matrix}]}^{T}

and

K \in R^{L \times L}

is the kernel matrix with elements

K_{s r} = K (x_{s}, x_{r})

.

u = {[\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}]}^{T} \in R^{L}

is a vector and

y = {[\begin{matrix} y_{1} & y_{2} & \dots & y_{L} \end{matrix}]}^{T} \in R^{L}

represent the desired outputs. Additionally, the linear term is given by

p = - y

.

The following subsections provide a comprehensive description of the decomposition of the problem (22), the solution of the resulting decomposed problem, the determination of the stopping criterion and the selection of the working set in the proposed SMO-like algorithm.

3.1. Decomposition and Solution Based on Brent’s Method

The SMO algorithm updates only two variables in each iteration such that

α_{i}^{k + 1} = α_{i}^{k} - d_{j}

and

α_{j}^{k + 1} = α_{j}^{k} + d_{j}

to satisfy the equality constraint in (22).

α_{i}^{k}

and

α_{j}^{k}

represent the constant old parameter values, while

d_{j}

denotes the update length. By substituting

α_{i} = α_{i}^{k} - d_{j}

and

α_{j} = α_{j}^{k} + d_{j}

into (22) and discarding the constant terms, the decomposed problem for a single variable,

d_{j}

, is as follows.

\begin{matrix} min_{d_{j} \in R^{1}} \hat{J} (d_{j}) & = & \frac{1}{2} a_{i j} d_{j}^{2} + b_{i j} d_{j} + T (α_{i}^{k} - d_{j}) + T (α_{j}^{k} + d_{j}) + ε | α_{i}^{k} - d_{j} | + ε | α_{j}^{k} + d_{j} | \\ subject to & l b \leq d_{j} \leq u b \end{matrix}

(23)

where

l b = \max (- \hat{C} - α_{j}^{k}, - \hat{C} + α_{i}^{k})

and

u b = \min (\hat{C} - α_{j}^{k}, \hat{C} + α_{i}^{k})

,

a_{i j} = K_{i i} + K_{j j} - 2 K_{i j} > 0

,

b_{i j} = \nabla f_{q} {(α^{k})}_{j} - \nabla f_{q} {(α^{k})}_{i}

with

f_{q} (α^{k}) = \frac{1}{2} {(α^{k})}^{T} K (α^{k}) + p^{T} α^{k}

which is the quadratic part of the loss (22).

Definition 1

(Violating pair). If

\{i, j\} \in \{1, \dots, L\}

and

- \nabla f {(α)}_{i} - ε s i g n^{+} (α_{i}) > - \nabla f {(α)}_{j} - ε s i g n^{-} (α_{j})

, then

\{i, j\}

is a “violating pair” where

f (α) = \frac{1}{2} α^{T} K α + p^{T} α + \sum_{s = 1}^{L} T (α_{s})

is the smooth part of the loss (22),

s i g n^{+} (α_{i}) \overset{def}{=} \{\begin{matrix} 1, & if α_{i} \geq 0 \\ - 1, & if α_{i} < 0 \end{matrix}

and

s i g n^{-} (α_{i}) \overset{def}{=} \{\begin{matrix} 1, & if α_{i} > 0 \\ - 1, & if α_{i} \leq 0 \end{matrix}

.

Proposition 2.

If

\{i, j\}

is a violating pair, the global optimum of the problem (23) satisfies

d_{j}^{*} < 0

.

Proof.

The subdifferential of problem (23) is obtained as

\partial \hat{J} (d_{j}) = \{\begin{matrix} [\nabla \hat{f} (m_{1}) - 2 ε, \nabla \hat{f} (m_{1})], & if d_{j} = m_{1} \\ [\nabla \hat{f} (m_{2}), \nabla \hat{f} (m_{2}) + 2 ε], & if d_{j} = m_{2} \\ {\hat{J}}_{+}^{'} (d_{j}), & otherwise \end{matrix}

(24)

where

{\hat{J}}_{+}^{'} (d_{j}) = \nabla \hat{f} (d_{j}) + ε s i g n^{+} (α_{j}^{k} + d_{j}) - ε s i g n^{-} (α_{i}^{k} - d_{j})

is the right-hand-side derivative of the loss function (23),

\hat{f} (d_{j}) = \frac{1}{2} a_{i j} d_{j}^{2} + b_{i j} d_{j} + T (α_{i}^{k} - d_{j}) + T (α_{j}^{k} + d_{j})

is the smooth part of the loss (23), and

m_{1} = min (α_{i}^{k}, - α_{j}^{k})

and

m_{2} = max (α_{i}^{k}, - α_{j}^{k})

are the break points of this loss function.

For the first case of (24),

\partial \hat{J} (0) \in (0, \infty)

since the pair

\{i, j\}

is a violating pair and satisfies

\nabla \hat{f} (0) = \nabla f {(α)}_{j} - \nabla f {(α)}_{i} > ε s i g n^{+} (α_{i}) - ε s i g n^{-} (α_{j}) > 2 ε

where

d_{j} = m_{1} \leq m_{2}

for this case. Similarly, for the second case of (24),

\partial \hat{J} (0) \in (0, \infty)

since

\nabla \hat{f} (0) > 0

where

d_{j} = m_{2} \geq m_{1}

. For the last case of (24),

{\hat{J}}_{+}^{'} (0) = \nabla \hat{f} (0) + ε s i g n^{+} (α_{j}^{k}) - ε s i g n^{-} (α_{i}^{k}) = \nabla f {(α)}_{j} - \nabla f {(α)}_{i} + ε s i g n^{-} (α_{j}) - ε s i g n^{+} (α_{i}) > 0

for

d_{j} \neq m_{1}

and

d_{j} \neq m_{2}

since the pair

\{i, j\}

is a violating pair. Therefore,

d_{j}^{*} < 0

since all subgradients at

d_{j} = 0

are positive and the problem in (23) is strictly convex. □

Proposition 3.

If

\{i, j\}

is a violating pair and

a_{i j} = K_{i i} + K_{j j} - 2 K_{i j} > 0

, problem (23) has a global optimum defined as follows:

d_{j}^{*} = \{\begin{matrix} m_{1}, if 0 \leq \nabla \hat{f} (m_{1}) \leq 2 ε and l b \leq m_{1} \leq 0 \\ m_{2}, if - 2 ε \leq \nabla \hat{f} (m_{2}) \leq 0 and l b \leq m_{2} \leq 0 \\ {d_{j} | {\hat{J}}_{+}^{'} (d_{j}) = 0, l b \leq d_{j} < 0}, otherwise \end{matrix}

(25)

where

{\hat{J}}_{+}^{'} (d_{j}) = \nabla \hat{f} (d_{j}) + ε s i g n^{+} (α_{j}^{k} + d_{j}) - ε s i g n^{-} (α_{i}^{k} - d_{j})

is the right-hand-side derivative of the loss function (23), and

m_{1} = min (α_{i}^{k}, - α_{j}^{k})

and

m_{2} = max (α_{i}^{k}, - α_{j}^{k})

are the break points of this loss function.

Proof.

The optimization problem (23) is nondifferentiable but strictly convex, ensuring the existence of a unique optimal solution due to the condition

a_{i j} = K_{i i} + K_{j j} - 2 K_{i j} > 0

. This strict convexity with

d_{j}^{*} < 0

defined in Proposition 2 implies

\partial \hat{J} (0) \in (0, \infty)

and

\partial \hat{J} (l b) \in (- \infty, 0)

. Therefore, the optimality condition satisfies

0 \in \partial \hat{J} (d_{j}^{*})

together with

l b < d_{j}^{*} < 0

. So, the global optimum (25) is obtained from the subdifferential defined in (24) together with

0 \in \partial \hat{J} (d_{j}^{*})

and

l b < d_{j}^{*} < 0

. □

Consequently, the values of the two Lagrange multipliers can be updated as follows:

α_{j}^{k + 1} = α_{j}^{k} + d_{j}^{*}

(26)

and

α_{i}^{k + 1} = α_{i}^{k} - d_{j}^{*}

(27)

where the optimal solution of

d_{j}^{*}

is determined by (25) and the equation

{d_{j} | {\hat{J}}_{+}^{'} (d_{j}) = 0, l b \leq d_{j} < 0}

in (25) is solved using the well-known Brent’s method. The SMO algorithm continues to update two elements of the vector

α

until the stopping criterion is satisfied, as described in the following subsection.

3.2. Stopping Criterion

The Lagrangian of the problem (22) is as follows.

\begin{matrix} min_{α \in R^{L}, b \in R^{1}} J (α) & = & \frac{1}{2} α^{T} K α + p^{T} α + \sum_{s = 1}^{L} T (α_{s}) + ε {∥α∥}_{1} \\ + b u^{T} α - \sum_{s = 1}^{L} μ_{s} (\hat{C} - α_{s}) - \sum_{s = 1}^{L} κ_{s} (\hat{C} + α_{s}) . \end{matrix}

(28)

The optimality conditions of (28), known as the Karush–Kuhn–Tucker (KKT) conditions, are presented as follows:

\begin{matrix} 0 \in \partial J {(α)}_{s}, \end{matrix}

(29)

\begin{matrix} u^{T} α = 0, - \hat{C} \leq α_{s} \leq \hat{C}, \end{matrix}

(30)

\begin{matrix} μ_{s} (\hat{C} - α_{s}) = 0, κ_{s} (\hat{C} + α_{s}) = 0, \end{matrix}

(31)

\begin{matrix} μ_{s} \geq 0, κ_{s} \geq 0, \forall s \in {1, \dots, L} \end{matrix}

(32)

where

\partial J {(α)}_{s} = \{\begin{matrix} \nabla f {(α)}_{s} + ε + b + μ - κ, & if α_{s} > 0 \\ \nabla f {(α)}_{s} - ε + b + μ - κ, & if α_{s} < 0 \\ [\nabla f {(α)}_{s} - ε + b + μ - κ, \nabla f {(α)}_{s} + ε + b + μ - κ], & if α_{s} = 0 \end{matrix}

denotes the subdifferential of (28). The update of the optimization variables in (26) and (27) already accounts for the equality condition in (30). The remaining KKT optimality conditions (29)–(32) can be reformulated as follows.

\begin{matrix} b = - \nabla f {(α)}_{s} - ε - μ + κ & if & α_{s} > 0 \\ b = - \nabla f {(α)}_{s} + ε - μ + κ & if & α_{s} < 0 \\ - \nabla f {(α)}_{s} - ε - μ + κ \leq b \leq - \nabla f {(α)}_{s} + ε - μ + κ & if & α_{s} = 0 \end{matrix}

(33)

Considering

α_{s} < \hat{C} \Rightarrow μ_{s} = 0, κ_{s} \geq 0

and

α_{s} > - \hat{C} \Rightarrow κ_{s} = 0, μ_{s} \geq 0

, the above conditions (33) are satisfied if and only if

m (α) \leq M (α)

(34)

where

m (α) = max_{s \in \{s | α_{s} < \hat{C}\}} g_{s} (α)

,

M (α) = min_{s \in \{s | α_{s} > - \hat{C}\}} G_{s} (α)

with

g_{s} (α) = - \nabla f {(α)}_{s} - ε s i g n^{+} (α_{s})

and

G_{s} (α) = - \nabla f {(α)}_{s} - ε s i g n^{-} (α_{s})

. Therefore, a feasible

α

is an optimal point of (28) if and only if it satisfies (34). For the sake of computational efficiency, a relaxed stopping criterion is defined as follows:

m (α^{k}) - M (α^{k}) \leq τ

(35)

with

τ

denoting a small positive value and (35) used as the stopping criterion in the SMO-like algorithm in conjunction with the KKT conditions of the nonsmooth nonlinear optimization problem (28).

3.3. Working Set Selection

The SMO algorithm solves the optimization problem by optimizing only two Lagrange multipliers in each iteration. The procedure for determining these two Lagrange multipliers, called the working set selection procedure, is the most critical part of the SMO algorithm that affects its computational cost. The WSS procedure should be both easy to compute and provide a sufficient reduction in the consecutive loss function values. To achieve this, the proposed WSS procedure relies on defining an upper bound for the second-order Taylor polynomial approximation of consecutive loss functions and selects the following easy-to-compute working set that satisfies being a violating pair.

(1): For all t, s define $a_{t s} = K_{t t} + K_{s s} - 2 K_{t s} > 0$
select

$i \in {argmax}_{t \in \{t | α_{t} < \hat{C}\}} \{g_{t} (α^{k})\}$

(36)

$j \in {argmin}_{t \in \{t | α_{t} > - \hat{C}\}} \{\frac{- h_{i t}^{2}}{a_{i t}} | G_{t} (α^{k}) < g_{i} (α^{k})\}$

(37)

where $h_{i t} = g_{i} (α^{k}) - G_{t} (α^{k})$

(2): Return $\{i, j\}$

Since

d_{j}^{*} < 0

as shown in Proposition 2, the left-hand-side derivatives are employed to obtain an upper bound on the quadratic approximation using a Taylor series expansion of the difference between consecutive loss function values around

d_{j}^{*} = 0

.

J (α^{k + 1}) - J (α^{k}) = \hat{J} (d_{j}^{*}) - \hat{J} (0) < \hat{J} (d_{j}) - \hat{J} (0) \approx {\hat{J}}_{-}^{'} (0) d_{j} + \frac{{\hat{J}}_{-}^{″} (0)}{2} d_{j}^{2}

(38)

Instead of directly finding the pair

{i, j}

minimizing the difference of the consecutive loss functions

J (α^{k + 1}) - J (α^{k})

, it is more computationally efficient to find the pair

{i, j}

minimizing the second-order Taylor polynomial approximation. The following proposition suggests an upper bound associated with minimizing this approximation, which is easy to compute.

Proposition 4.

If

{i, j}

is a violating pair, there is an upper bound on the second-order Taylor polynomial approximation of consecutive loss function values (38) such that

min_{d_{j}} {\hat{J}}_{-}^{'} (0) d_{j} + \frac{{\hat{J}}_{-}^{″} (0)}{2} d_{j}^{2} < - \frac{h_{i j}^{2}}{a_{i j}}

(39)

and the optimal value of the minimization problem in (39) satisfies

d_{j}^{*} < 0

.

Proof.

Since the Taylor polynomial approximation is quadratic, its minimum value is at

d_{j}^{*} = - \frac{{\hat{J}}_{-}^{'} (0)}{{\hat{J}}_{-}^{″} (0)}

. Substituting it in (39) results in

{\hat{J}}_{-}^{'} (0) d_{j}^{*} + \frac{{\hat{J}}_{-}^{″} (0)}{2} {(d_{j}^{*})}^{2} \leq - \frac{1}{2} \frac{{[{\hat{J}}_{-}^{'} (0)]}^{2}}{{\hat{J}}_{-}^{″} (0)}

(40)

where

{\hat{J}}_{-}^{'} (0) = \nabla \hat{f} (0) + ε s i g n^{-} (α_{j}^{k}) - ε s i g n^{+} (α_{i}^{k}) = \nabla f {(α)}_{j} - \nabla f {(α)}_{i} + ε s i g n^{-} (α_{j}^{k}) - ε s i g n^{+} (α_{i}^{k}) = g (i) - G (j) = h_{i j} > 0

and

{\hat{J}}_{-}^{″} (0) = \nabla^{2} \hat{f} (0) = a_{i j} + \frac{\hat{C}}{η_{2} [{\hat{C}}^{2} - {(α_{j}^{k})}^{2}]} + \frac{\hat{C}}{η_{2} [{\hat{C}}^{2} - {(α_{i}^{k})}^{2}]} \geq a_{i j} + \frac{2}{η_{2} C} > a_{i j} > 0

. Therefore, it is obtained that

- \frac{1}{2} \frac{{[{\hat{J}}_{-}^{'} (0)]}^{2}}{{\hat{J}}_{-}^{″} (0)} < - \frac{h_{i j}^{2}}{a_{i j}}

(41)

since

{\hat{J}}_{-}^{'} (0) = h_{i j}

and

{\hat{J}}_{-}^{″} (0) > a_{i j}

. It is obvious that

d_{j}^{*} < 0

and (40) with (41) conclude the proof. □

It should be noted that the proposed WSS procedure, described in (36) and (37), is computationally efficient where i is chosen as one of the maximum violating pairs and j is chosen as the argument that minimizes the upper bound of the second-order Taylor polynomial approximation of consecutive loss function values. The proposed WSS is an extension of [21] where j is chosen in a manner that it is the argument of the minimum of an upper bound of consecutive loss function values. However, by introducing the idea of selecting j associated with the second-order Taylor polynomial approximation, it allows SO-like information to be used to solve a more complex nonsmooth nonlinear optimization problem.

Algorithm 1 presents the pseudocode for the three fundamental parts of the SMO algorithm. To avoid numerical problems, especially for values close to the constraints, a small perturbation value

δ

is added to the relevant sections in Algorithm 1 and chosen such that

δ = \hat{C} \times 10^{- 5}

. Designing the algorithm also prioritizes computational efficiency. For example, consider the quadratic part of the loss function defined in (22):

f_{q} (α) = \frac{1}{2} α^{T} K α + p^{T} α

. When updating the gradient

\nabla f_{q} (α) = K α + p

, we avoid calculating the entire kernel matrix

K

. Instead, an iterative approach is employed, where the gradient is updated at each step as

\nabla f_{q} (α) : = \nabla f_{q} (α) - K_{i} d_{j}^{*} + K_{j} d_{j}^{*}

. This reduces the computational cost significantly. Furthermore, the terms

\nabla T {(α)}_{i}

and

\nabla T {(α)}_{j}

are also updated iteratively as shown in Algorithm 1 to avoid additional redundant nonlinear computations, where

T (α_{s}) = - \frac{\hat{C}}{η_{2}} ln [cosh ({tanh}^{- 1} (\frac{α_{s}}{\hat{C}}))] + \frac{α_{s}}{η_{2}} {tanh}^{- 1} (\frac{α_{s}}{\hat{C}})

.

The convergence results from [21] for

ε

-

l_{2}

SVR can be extended to

ε

-ln SVR in a different manner that relies on the strong convexity of the

ε

-ln SVR problem. It should be noted that, when

η_{1} = η_{2}^{2} \to 0

, the

ε

-ln SVR becomes equivalent to the

ε

-

l_{2}

SVR, and Lemma 1 also becomes equivalent to Lemma 2 in [21] derived for the nonsmooth, indeed piecewise quadratic, dual problem of

ε

-

l_{2}

SVR. In this context, only the following lemma, which determines an upper bound on the decrease in the consecutive loss function values in each iteration of the SMO algorithm, is presented.

Lemma 1.

The decrease in the dual function (22) in an iteration of SMO satisfies

J (α^{k + 1}) - J (α^{k}) \leq - \frac{{∥ α^{k + 1} - α^{k} ∥}^{2}}{2 η_{2} \hat{C}}

(42)

Proof.

\hat{f} (d_{j})

is strongly convex because

\hat{f} (d_{j}) - \frac{1}{η_{2} \hat{C}} d_{j}^{2}

is convex where its second derivative is

a_{i j} + \frac{1}{η_{2} \hat{C} [1 - \frac{{(α_{j}^{k} + d_{j})}^{2}}{{\hat{C}}^{2}}]} + \frac{1}{η_{2} \hat{C} [1 - \frac{{(α_{i}^{k} - d_{j})}^{2}}{{\hat{C}}^{2}}]} - \frac{2}{\hat{C}} \geq a_{i j}

is greater than zero. Herein,

a_{i j} = K_{i i} + K_{j j} - 2 K_{i j} > 0

since the kernel matrix

K

satisfying Mercer’s condition is a positive definite matrix. Therefore,

\hat{J} (d_{j})

is also strongly convex, since

\hat{J} (d_{j}) = \hat{f} (d_{j}) + | α_{i}^{k} - d_{j} | + | α_{j}^{k} + d_{j} |

. Then, by the definition of strong convexity [51], it can be written that

\hat{J} (d_{j}^{*}) - \hat{J} (0) \leq z_{d_{j}^{*}} d_{j}^{*} - \frac{1}{η_{2} \hat{C}} {(d_{j}^{*})}^{2} = - \frac{1}{η_{2} \hat{C}} {(d_{j}^{*})}^{2}

(43)

where

z_{d_{j}^{*}} \in \partial \hat{J} (d_{j}^{*})

is the subgradient of

\hat{J}

at

d_{j}^{*}

and

z_{d_{j}^{*}} = 0

since

d_{j}^{*}

is the optimal point. The proof concludes with

J (α^{k + 1}) - J (α^{k}) = \hat{J} (d_{j}^{*}) - \hat{J} (0) \leq - \frac{1}{η_{2} \hat{C}} {(d_{j}^{*})}^{2} = - \frac{{∥ α^{k + 1} - α^{k} ∥}^{2}}{2 η_{2} \hat{C}}

. □

Algorithm 1: SMO-like algorithm for the nonsmooth nonlinear dual problem

input: Training data

\{\begin{matrix} (x_{1}, y_{1}) & \dots & (x_{L}, y_{L}) \end{matrix}\}

output:

α

, b

Initialize by setting:

α = 0

,

\nabla f_{q} (α) = p

,

T (α_{s}) = 0

,

g_{s} (α) : = - \nabla f {(α)}_{s} - ε s i g n^{+} (α_{s})

and

G_{s} (α) : = - \nabla f {(α)}_{s} - ε s i g n^{-} (α_{s})

\forall s \in \{1, \dots, L\}

repeat

until the stopping criterion (35) is satisfied as

m (α) - M (α) \leq τ

;

Calculate

b : = \frac{1}{2} (max_{s \in \{s | α_{s} < \hat{C} - δ\}} g_{s} (α) + min_{s \in \{s | α_{s} > - \hat{C} + δ\}} G_{s} (α))

4. Experiments

This section considers the dual problem (14) of the proposed

ε

-ln SVR, where a specific relationship holds between parameters

η_{1}

and

η_{2}

such as

η_{2} = \sqrt{\frac{1}{2} ψ_{1} (\frac{1}{2 η_{1}})}

. Tuning

η_{1}

optimizes the loss function for a family of PHS distributions, as detailed in Section 2.2. To solve this challenging nonsmooth nonlinear dual problem (14) including linear equality and box constraints, a novel SMO-like algorithm with an efficient WSS procedure is introduced. This approach involves minimizing an upper bound on the second-order approximation, derived from the Taylor series expansion between consecutive loss function values.

For evaluating the

ε

-ln SVR, comparisons are drawn against the

ε

-

l_{2}

SVR and the

ε

-SVR, both respectively solved by SMO algorithms [21,52]. The proposed WSS procedure with SO-like information, which is one of the parts of the SMO-like algorithm that most affects the convergence time, is compared with its FO counterpart. Additionally, the proposed SMO algorithm for the nonsmooth dual problem is compared with the SMO algorithm for its smooth counterpart (7). All implementations are written in C++. The SMO-like algorithm presented in Algorithm 1 is adapted for use with the well-known LIBSVM library [52] to take advantage of its efficient kernel computation and caching mechanisms.

All experiments were carried out on a PC equipped with an Intel Core i5-12450H processor and 16 GB of RAM, operating on a 64-bit Windows 11 system. The RBF kernel, defined as

K (x_{s}, x_{r}) = exp (- {∥ x_{s} - x_{r} ∥}^{2} / 2 σ^{2})

, was used. A stopping criterion (35) with

τ = 10^{- 3}

and a cache size of 100 MB was set across all SMO algorithm implementations. Five-fold cross-validation was used to tune the regularization parameter, kernel parameter and

η_{1}

within specified ranges

\{10^{- 1}, 10^{0}, 10^{1}, 10^{2}, 10^{3}\}

,

\{2^{- 3}, 2^{- 2}, 2^{- 1}, 2^{0}, 2^{1}\}

and

\{2^{- 3}, 2^{- 2}, \dots, 2^{3}\}

, respectively. The epsilon parameter was then chosen from a separate set

\{0, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 1.5, 2, 2.5, 4\}

. The choice of the

η_{1}

parameter range is based on the understanding of its influence on the lncosh loss function. As shown in Figure 1, smaller values of

η_{1}

result in behavior similar to the

l_{2}

loss, while larger values resemble the

l_{1}

loss. This is consistent with the theoretical basis given in Equations (20) and (21), where the

η_{1}

tuning covers the spectrum between Vapnik’s loss and the insensitive

l_{2}

loss for its extreme values. In addition, Figure 1e shows the relationship between

η_{1}

and the associated probability density function. In particular,

η_{1} = 2^{3}

and

η_{1} = 2^{- 3}

lead to approximations of the Laplacian and Gaussian distributions, respectively. Based on these observations, we chose a range for

η_{1}

that effectively captures this transition in the behavior of the loss function and the probability density. Root Mean Square Error (RMSE) was used for performance evaluation. Results are averages of four cross-validation repetitions, obtained by running three SVR variants on nine benchmark datasets. Among these datasets, Mpg, Housing, Space ga, Abalone and CpuSmall are accessible on the LIBSVM website (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, accessed on 20 April 2024), while the remaining datasets were from the UCI machine-learning repository (https://archive.ics.uci.edu/datasets, accessed on 20 April 2024). While outputs remained unchanged, inputs were normalized to the closed interval

[0, 1]

due to varying dataset ranges. The outputs were also contaminated with additive Gaussian

N (0, 0.04)

and Cauchy noises

C (0, 0.1)

to compare the noise and outlier robustness of the proposed and traditional SVR variants. Iteration and training time ratios demonstrated in Figure 2 and Figure 3 are defined as follows:

iteration ratio = \frac{# iter . by the proposed method}{# iter by other method}

training time ratio = \frac{time by the proposed method}{time by other method}

Table 1 shows a comparison of

ε

-

l_{2}

SVR,

ε

-SVR and the proposed

ε

-ln SVR variants. “# of SVs” is the number of support vectors identified by the model during training. “Test RMSE” represents the RMSE on the unseen test data, which evaluates the generalization performance of the model. “Training cpu time” indicates the computational time required to train the model on the training data. “# of iterations” indicates the number of iterations required for the SMO-like optimization algorithm described in Algorithm 1 to converge. “C” is the regularization parameter, which controls the trade-off between fitting the data and avoiding overfitting. “

σ

” is the kernel parameter, which controls the influence of the data points in the kernel function. “

ε

” is the epsilon parameter, which defines the tolerance for errors within the

ε

-insensitive zone of the SVR model. “

η_{1}

” is a hyperparameter specific to the proposed

ε

-ln SVR loss function, controlling the shape of the lncosh loss function as shown in Figure 1. The values of the hyperparameters (C,

σ

,

ε

,

η_{1}

) differ between the SVR variants because the employed losses have different characteristics. The

l_{2}

loss function lacks a bounded influence function and is weak in terms of outlier robustness. Therefore, it is observed that the value of the regularization parameter C for

ε

-

l_{2}

SVR is the lowest for most datasets to avoid overfitting, given that the datasets are contaminated with Cauchy and Gaussian noise. Additionally, it is noted that

ε

-

l_{2}

SVR tends to have the largest

σ

for most datasets to mitigate the risk of overfitting, as a lower

σ

results in a more localized influence, leading to complex fitting functions, especially in the presence of outliers and noise. To emphasize the influence of the tunable parameter

η_{1}

within the lncosh loss function, datasets are selected from commonly used regression benchmarks, ensuring diversity in size and characteristics. The datasets are arranged in increasing order of training samples, ranging from small (e.g., Servo) to large (e.g., CpuSmall). Notably, datasets are chosen that exhibit a range of optimal

η_{1}

values, as shown in Table 1. This selection allows us to demonstrate the effectiveness of the lncosh loss function and its tunability across diverse datasets.

While

ε

-SVR and

ε

-

l_{2}

SVR employ loss functions optimized for specific noise distributions such as Laplace and Gaussian, respectively, the

ε

-ln SVR offers greater flexibility. Its modified lncosh loss is optimal for the broader family of power-raised hyperbolic secant distributions, encompassing Laplace, Gaussian and hyperbolic secants as special cases. Moreover, as indicated in (20) and (21), the lncosh loss approaches Vapnik’s and

ε

-insensitive

l_{2}

losses for the limit values of

η_{1}

. For other values of

η_{1}

, it continues to exhibit its inherent properties, including robustness to noise and outliers. So, it effectively addresses outliers induced by Cauchy noise while also handling small noises mostly exhibiting Gaussian distribution. Therefore, the

ε

-ln SVR has better test RMSE compared to

ε

-SVR and

ε

-

l_{2}

SVR, as shown in Table 1, thanks to its ability to be optimal for different noise distributions by tuning

η_{1}

. While these results are an extension of the previous work [40], a computationally efficient SMO-like algorithm is provided to solve the nonsmooth nonlinear dual problem of

ε

-ln SVR in order to handle larger datasets. Furthermore, it has been observed that solving such a complex nonsmooth nonlinear optimization problem (14) requires comparable training times as seen in Table 1 compared to solving the piecewise QP problems of

ε

-

l_{2}

SVR [21] and QP problems of

ε

-SVR [52].

ε

-SVR and

ε

-

l_{2}

SVR have a well-defined analytical solution at each iteration for solving the subproblem consisting of two optimization parameters. Despite the lack of a direct analytical solution, the proposed SMO algorithm uses Brent’s method at each iteration to address this complex problem and provides acceptable training times through its easily computable WSS procedure. Due to its fast nature and single execution per iteration as demonstrated in Algorithm 1, Brent’s method contributes minimally to the overall running time of the algorithm.

The SMO algorithm solves the optimization problem iteratively by updating only two Lagrangian multipliers at each step. The choice of the right pair, known as the working set selection procedure, is crucial as it significantly affects the convergence speed. Therefore, the proposed WSS procedure with SO-like information, which is designed to minimize an upper bound on the second-order Taylor polynomial approximation of consecutive loss function values, is compared with the traditional WSS procedure with FO information. It is observed that the proposed SMO with easily computable WSS procedure significantly improves both the number of iterations and the training times compared to its FO counterpart, as shown in Figure 2. This is evident in all datasets, where both time and iteration ratios are consistently less than 1. The strength of the WSS procedure lies in its efficient retrieval of SO-like information. This allows it to prioritize pairs that maximize the decrease in consecutive loss function values. It achieves this by estimating an upper bound on the Taylor polynomial approximation, unlike the FO counterpart which simply uses gradient information. This often results in many more iterations for the FO counterpart because it lacks insight into the potential reduction in consecutive losses.

The proposed SMO algorithm is specifically designed to optimize dual convex nonsmooth problem (14) but it can also handle the smooth counterpart (7) as a special case. To demonstrate the effectiveness of the SMO algorithm for solving this nonsmooth formulation, we compare its performance to the smooth counterpart. As shown in Figure 3, SMO algorithm for solving this nonsmooth version consistently performs better, with iteration ratios close to 1 and training time ratios significantly less than 1 for all datasets. This stems from the smooth version’s SMO algorithm handling twice the number of optimization variables, requiring caching a larger kernel matrix of size

K \in R^{2 L \times 2 L}

rather than

K \in R^{L \times L}

. The findings from comparing the proposed method with its FO and smooth counterparts are consistent with previous studies on SMO algorithms for solving piecewise QP problems in

ε

-

l_{2}

SVR [21] and

ε

-SVR [22], but with the key distinction that the idea of minimizing an upper bound of the second-order Taylor polynomial approximation in WSS allows for a computationally efficient SMO algorithm even for this complex nonsmooth problem. Figure 2b and Figure 3b demonstrate the efficiency further, presenting ratios for total iterations and times during hyperparameter selection, again highlighting the method’s advantages across the different values of the hyperparameters.

The proposed SMO-like algorithm generalizes the classical SMO algorithm, designed for solving QP problems, to tackle more general nonsmooth, nonlinear convex dual problems arising in SVR with various loss functions. This generalization allows it to be adapted to both

ε

-

l_{2}

SVR and

ε

-

l_{1}

SVR. Due to the existence of analytical solutions for each iteration in the SMO-like algorithms for

ε

-

l_{2}

SVR and

ε

-

l_{1}

SVR, Brent’s method becomes unnecessary. Consequently, only the WSS part, a key innovation of our algorithm, is embedded into the SMO-like algorithms presented in [21,22]. As shown in Figure 4a,b, the iteration ratios for both

ε

-

l_{2}

SVR and

ε

-

l_{1}

SVR cases are exactly 1. This indicates that the proposed WSS, which leverages an upper bound based on the second-order Taylor polynomial approximation of consecutive loss function values, performs the same number of total iterations during hyperparameter selection for all datasets. Figure 4 presents the ratios for total times during hyperparameter selection. The proposed, easy-to-compute WSS described in Algorithm 1 demonstrates clear efficiency gains compared to those in [21,22], since time ratios are below 1 for all datasets. The efficiency comes from incorporating the concept of working set selection, which is associated with the second-order Taylor polynomial approximation, and defining an upper bound that is easy to compute, as described in Proposition 4. This enables the efficient use of SO-like information to effectively tackle SVR problems, even those involving complex nonsmooth nonlinear convex scenarios.

5. Discussion

In this study, we introduce the

ε

-ln SVR model with a flexible lncosh loss function and demonstrate significant advances in its optimization and applicability. We derive a computationally efficient nonsmooth dual formulation of the problem, which addresses the challenges of nonlinearity and nondifferentiability. To overcome these complexities, a novel SMO-like algorithm with an effective WSS procedure is developed. This WSS procedure exploits second-order information by minimizing an upper bound on the Taylor polynomial approximation of consecutive loss function values, resulting in improved computational efficiency compared to its first-order and smooth counterparts. In addition, the single-parameter adjustable lncosh loss function is shown to be optimal in the maximum likelihood sense for the PHS distribution, which includes Laplace, Gaussian and hyperbolic secant distributions. This adjustable single-parameter design is shown to be advantageous for adaptation to unknown noise distributions in practical applications. Overall, the ability of the proposed SMO-like algorithm to handle a class of nonlinear convex problems demonstrates its potential applicability to SVR models with different loss functions optimized for different noise distributions and other related problems such as Lasso and Extreme Learning Machine. We expect that this innovative combination of

ε

-ln SVR and an adapted SMO-like algorithm will pave the way for more robust and efficient SVR implementations with diverse loss functions that are optimal in the maximum likelihood sense for different noise distributions.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Vapnik, V.N. Statistical Learning Theory; John Wiley & Sons: New York, NY, USA, 1998. [Google Scholar]
Boser, B.; Guyon, I.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992. [Google Scholar]
Cortes, C.; Vapnik, V.N. Support-vector network. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Arnosti, N.A.; Kalita, J.K. Cutting Plane Training for Linear Support Vector Machines. IEEE Trans. Knowl. Data Eng. 2013, 25, 1186–1190. [Google Scholar] [CrossRef]
Chu, D.; Zhang, C.; Tao, Q. A Faster Cutting Plane Algorithm with Accelerated Line Search for Linear SVM. Pattern Recognit. 2017, 67, 127–138. [Google Scholar] [CrossRef]
Xu, Y.; Akrotirianakis, I.; Chakraborty, A. Proximal gradient method for huberized support vector machine. Pattern Anal. Appl. 2016, 19, 989–1005. [Google Scholar] [CrossRef]
Ito, N.; Takeda, A.; Toh, K.C. A unified formulation and fast accelerated proximal gradient method for classification. J. Mach. Learn. Res. 2017, 18, 1–49. [Google Scholar]
Majlesinasab, N.; Yousefian, F.; Pourhabib, A. Self-Tuned Mirror Descent Schemes for Smooth and Nonsmooth High-Dimensional Stochastic Optimization. IEEE Trans. Autom. Control 2019, 64, 4377–4384. [Google Scholar] [CrossRef]
Balasundaram, S.; Gupta, D.; Kapil. Lagrangian support vector regression via unconstrained convex minimization. Neural Netw. 2014, 51, 67–79. [Google Scholar] [CrossRef]
Balasundaram, S.; Yogendra, M. A new approach for training Lagrangian support vector regression. Knowl. Inf. Syst. 2016, 49, 1097–1129. [Google Scholar] [CrossRef]
Balasundaram, S.; Benipal, G. On a new approach for Lagrangian support vector regression. Neural Comput. Appl. 2018, 29, 533–551. [Google Scholar] [CrossRef]
Wang, H.; Shi, Y.; Niu, L.; Tian, Y. Nonparallel Support Vector Ordinal Regression. IEEE Trans. Cybern. 2017, 47, 3306–3317. [Google Scholar] [CrossRef]
Yin, J.; Li, Q. A semismooth Newton method for support vector classification and regression. Comput. Optim. Appl. 2019, 73, 477–508. [Google Scholar] [CrossRef]
Platt, J.C. Fast training of support vector machines using sequential minimal optimization. In Kernel Methods: Support Vector Machines; Schölkopf, B., Burges, C., Smola, A., Eds.; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Keerthi, S.S.; Shevade, S.K.; Bhattacharyya, C.; Murthy, K.R.K. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 2001, 13, 637–649. [Google Scholar] [CrossRef]
Fan, R.E.; Chen, P.H.; Lin, C.J. Working set selection using second order information for training support vector machines. J. Mach. Learn. Res. 2005, 6, 1889–1918. [Google Scholar]
Flake, G.W.; Lawrence, S. Efficient SVM regression training with SMO. Mach. Learn. 2002, 46, 271–290. [Google Scholar] [CrossRef]
Guo, J.; Takahashi, N.; Nishi, T. A novel sequential minimal optimization algorithm for support vector regression. In Neural Information Processing. ICONIP 2006. Lecture Notes in Computer Science; King, I., Wang, J., Chan, L.W., Wang, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 827–836. [Google Scholar]
Takahashi, N.; Guo, J.; Nishi, T. Global convergence of SMO algorithm for support vector regression. IEEE Trans. Neural Netw. 2008, 19, 971–982. [Google Scholar] [CrossRef]
Kocaoğlu, A. An efficient SMO algorithm for Solving non-smooth problem arising in ε-insensitive support vector regression. Neural Process. Lett. 2019, 50, 933–955. [Google Scholar] [CrossRef]
Kocaoğlu, A. A sequential minimal optimization algorithm with second-order like information to solve a non-smooth support vector regression constrained dual problem. Uludağ Univ. J. Fac. Eng. 2021, 26, 1111–1120. [Google Scholar] [CrossRef]
Tang, L.; Tian, Y.; Yang, C.A. Nonparallel support vector regression model and its SMO-type solver. Neural Netw. 2018, 105, 431–446. [Google Scholar] [CrossRef]
Abe, S. Optimizing working sets for training support vector regressors by Newton’s method. In Proceedings of the International Joint Conference on Neural Networks, Killarney, Ireland, 12–17 July 2015. [Google Scholar]
Keerthi, S.S.; Shevade, S.K. SMO algorithm for least-squares SVM formulations. Neural Comput. 2003, 15, 487–507. [Google Scholar] [CrossRef]
Lopez, J.; Suykens, J.A.K. First and Second Order SMO Algorithms for LS-SVM Classifiers. Neural Process. Lett. 2011, 33, 31–44. [Google Scholar] [CrossRef]
Kumar, R.; Sinha, A.; Chakrabarti, S.; Vyas, O.P. A fast learning algorithm for one-class slab support vector machines. Knowl. Based Syst. 2021, 53, 107267. [Google Scholar] [CrossRef]
Gu, B.; Shan, Y.; Quan, X.; Zheng, G. Accelerating sequential minimal optimization via Stochastic subgradient descent. IEEE Trans. Cybern. 2021, 51, 2215–2223. [Google Scholar] [CrossRef] [PubMed]
Galvan, G.; Lapucci, M.; Lin, C.J. A two-Level decomposition framework exploiting first and second order information for SVM training problems. J. Mach. Learn. Res. 2021, 22, 1–38. [Google Scholar]
Huang, X.; Shi, L.; Suykens, J.A.K. Support vector machine classifier with pinball loss. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 984–997. [Google Scholar] [CrossRef] [PubMed]
Huang, X.; Shi, L.; Suykens, J.A.K. Sequential minimal optimization for SVM with pinball loss. Neurocomputing 2015, 149, 1596–1603. [Google Scholar] [CrossRef]
Huang, X.; Shi, L.; Suykens, J.A.K. Asymmetric least squares support vector machine classifiers. Comput. Stat. Data Anal. 2014, 70, 395–405. [Google Scholar] [CrossRef]
Farooq, F.; Steinwart, I. An SVM-like approach for expectile regression. Comput. Stat. Data Anal. 2017, 109, 159–181. [Google Scholar] [CrossRef]
Balasundaram, S.; Meena, Y. Robust Support Vector Regression in Primal with Asymmetric Huber Loss. Neural Process. Lett. 2019, 49, 1399–1431. [Google Scholar] [CrossRef]
Zhang, S.; Hu, Q.; Xie, Z.; Mi, J. Kernel ridge regression for general noise model with its application. Neurocomputing 2015, 149, 836–846. [Google Scholar] [CrossRef]
Prada, J.; Dorronsoro, J.R. General noise support vector regression with non-constant uncertainty intervals for solar radiation prediction. J. Mod. Power Syst. Clean Energy 2018, 6, 268–280. [Google Scholar] [CrossRef]
Wanga, Y.; Yang, L.; Yuan, C. A robust outlier control framework for classification designed with family of homotopy loss function. Neural Netw. 2019, 112, 41–53. [Google Scholar] [CrossRef] [PubMed]
Anand, P.; Khemchandani, R.R.; Chandra, S. A class of new support vector regression models. Appl. Soft Comput. 2020, 94, 106446. [Google Scholar] [CrossRef]
Dong, H.; Yang, L. Kernel-based regression via a novel robust loss function and iteratively reweighted least squares. Knowl. Inf. Syst. 2021, 63, 1149–1172. [Google Scholar] [CrossRef]
Karal, O. Maximum likelihood optimal and robust Support Vector Regression with lncosh loss function. Neural Netw. 2017, 94, 1–12. [Google Scholar] [CrossRef]
Kocaoğlu, A.; Karal, Ö.; Güzeliş, C. Analysis of chaotic dynamics of Chua’s circuit with lncosh nonlinearity. In Proceedings of the 8th International Conference on Electrical and Electronics Engineering, Bursa, Turkey, 28–30 November 2013. [Google Scholar]
Liu, C.; Jiang, M. Robust adaptive filter with lncosh cost. Signal Process. 2020, 168, 107348. [Google Scholar] [CrossRef]
Liang, T.; Li, Y.; Zakharov, Y.V.; Xue, W.; Qi, J. Constrained least lncosh adaptive filtering algorithm. Signal Process. 2021, 183, 108044. [Google Scholar] [CrossRef]
Liang, T.; Li, Y.; Xue, W.; Li, Y.; Jiang, T. Performance and analysis of recursive constrained least lncosh algorithm under impulsive noises. IEEE Trans. Circuits Syst. II 2021, 68, 2217–2221. [Google Scholar] [CrossRef]
Guo, K.; Guo, L.; Li, Y.; Zhang, L.; Dai, Z.; Yin, J. Efficient DOA estimation based on variable least Lncosh algorithm under impulsive noise interferences. Digital Signal Process. 2022, 122, 103383. [Google Scholar] [CrossRef]
Yang, Y.; Zhou, H.; Gao, Y.; Wu, J.; Wang, Y.-G.; Fu, L. Robust penalized extreme learning machine regression with applications in wind speed forecasting. Neural Comput. Appl. 2022, 34, 391–407. [Google Scholar] [CrossRef]
Zhao, H.; Wang, Z.; Xu, W. Augmented complex least lncosh algorithm for adaptive frequency estimation. IEEE Trans. Circuits Syst. II 2023, 70, 2685–2689. [Google Scholar] [CrossRef]
Yang, Y.; Zhou, H.; Wu, J.; Ding, Z.; Tian, Y.-C.; Yue, D.; Wang, Y.-G. Robust adaptive rescaled lncosh neural network regression toward time-series forecasting. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 5658–5669. [Google Scholar] [CrossRef]
Faliva, M.; Zoia, M.G. A distribution family bridging the Gaussian and the Laplace laws, Gram–Charlier expansions, Kurtosis behaviour, and entropy features. Entropy 2017, 19, 149. [Google Scholar] [CrossRef]
Debruyne, M.; Hubert, H.; Suykens, J.A.K. Model selection in kernel based regression using the influence function. J. Mach. Learn. Res. 2008, 9, 2377–2400. [Google Scholar]
Bubeck, S. Convex Optimization: Algorithms and Complexity. Found. Trends Mach. Learn. 2015, 8, 231–357. [Google Scholar] [CrossRef]
Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines software. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–27. [Google Scholar] [CrossRef]

Figure 1. (a) The loss functions (9) for different values of

η_{1}

with

η_{2} = \sqrt{\frac{1}{2} ψ_{1} (\frac{1}{2 η_{1}})}

and

ε = 0

. (b) The loss functions with

ε = 0.3

. (c,d) The first derivatives of the corresponding loss functions. (e,f) The associated probability density functions, which approximate Laplacian and Gaussian distributions for

η_{1} = 2^{3}

and

η_{2} = 2^{- 3}

, respectively.

L (0, 1 / \sqrt{2})

and

N (0, 1)

denote Laplace and Gaussian distributions.

Figure 1. (a) The loss functions (9) for different values of

η_{1}

with

η_{2} = \sqrt{\frac{1}{2} ψ_{1} (\frac{1}{2 η_{1}})}

and

ε = 0

. (b) The loss functions with

ε = 0.3

. (c,d) The first derivatives of the corresponding loss functions. (e,f) The associated probability density functions, which approximate Laplacian and Gaussian distributions for

η_{1} = 2^{3}

and

η_{2} = 2^{- 3}

, respectively.

L (0, 1 / \sqrt{2})

and

N (0, 1)

denote Laplace and Gaussian distributions.

Figure 2. Comparison of the proposed SMO algorithm with the novel WSS procedure, which provides SO-like information, and the SMO with the traditional WSS procedure, which provides FO information. Iteration and training time ratios are presented for (a) the optimal hyperparameters specified in Table 1 and (b) the hyperparameter selection procedure.

Figure 3. Comparison of the proposed SMO algorithm for solving smooth (7) vs. nonsmooth (14) problems. Iteration and training time ratios are presented for (a) the optimal hyperparameters specified in Table 1 and (b) the hyperparameter selection procedure.

Figure 4. Comparison of the proposed SMO-like algorithm: (a) Adapted to solve

ε

-

l_{2}

SVR vs. the SMO-like algorithm introduced in [21] for hyperparameter selection procedure. (b) Adapted to solve

ε

-

l_{1}

SVR vs. the SMO-like algorithm introduced in [22] for hyperparameter selection procedure.

Figure 4. Comparison of the proposed SMO-like algorithm: (a) Adapted to solve

ε

-

l_{2}

SVR vs. the SMO-like algorithm introduced in [21] for hyperparameter selection procedure. (b) Adapted to solve

ε

-

l_{1}

SVR vs. the SMO-like algorithm introduced in [22] for hyperparameter selection procedure.

Table 1. A comparison of

ε

-

l_{2}

SVR,

ε

-SVR and

ε

-ln SVR.

Table 1. A comparison of

ε

-

l_{2}

SVR,

ε

-SVR and

ε

-ln SVR.

Dataset	Method	(C, $σ$ , $ε$ , $η_{1}$ )	# of SVs	Test RMSE	Training Cpu Time	# of Iterations
Servo (167x4)	$ε$ - $l_{2}$ SVR	( $10^{1}$ , $2^{0}$ , 0)	133.5 ± 0.53	0.886 ± 0.18	0.033 ± 0.04	459 ± 13.6
	$ε$ -SVR	( $10^{3}$ , $2^{0}$ , $0.02$ )	130.9 ± 1.79	0.785 ± 0.27	0.138 ± 0.03	57,100.2 ± 16,430.5
	$ε$ -ln SVR	( $10^{1}$ , $2^{- 1}$ , 0, $2^{1}$ )	133.6 ± 0.52	0.726 ± 0.28	0.052 ± 0.04	1253 ± 53.5
Auto-mpg (392x7)	$ε$ - $l_{2}$ SVR	( $10^{1}$ , $2^{- 1}$ , $0.5$ )	257.9 ± 7.23	2.776 ± 0.33	0.064 ± 0.03	1049.1 ± 37.7
	$ε$ -SVR	( $10^{2}$ , $2^{- 1}$ , $0.5$ )	245.5 ± 4.06	2.674 ± 0.36	0.022 ± 0.03	5236.1 ± 938.0
	$ε$ -ln SVR	( $10^{1}$ , $2^{- 1}$ , $1.5$ , $2^{- 1}$ )	160.2 ± 6.11	2.595 ± 0.27	0.039 ± 0.02	967.9 ± 88.9
Boston (560x13)	$ε$ - $l_{2}$ SVR	( $10^{2}$ , $2^{0}$ , 0)	404.8 ± 0.42	4.626 ± 1.72	0.059 ± 0.03	4827.3 ± 106.9
	$ε$ -SVR	( $10^{2}$ , $2^{0}$ , 1)	265.8 ± 8.32	3.461 ± 0.78	0.058 ± 0.05	4062.1 ± 820.9
	$ε$ -ln SVR	( $10^{2}$ , $2^{- 1}$ , 1, $2^{1}$ )	269.3 ± 5.12	3.151 ± 0.38	0.061 ± 0.03	5261.9 ± 252.4
Cooling (768x8)	$ε$ - $l_{2}$ SVR	( $10^{1}$ , $2^{0}$ , 0)	614.4 ± 0.52	3.031 ± 0.33	0.044 ± 0.03	2339.4 ± 59.1
	$ε$ -SVR	( $10^{2}$ , $2^{- 1}$ , 0)	614.4 ± 0.52	1.981 ± 0.20	0.222 ± 0.06	39,150.1 ± 4705.9
	$ε$ -ln SVR	( $10^{3}$ , $2^{0}$ , 0, $2^{- 3}$ )	614.4 ± 0.52	1.772 ± 0.15	0.153 ± 0.08	65,277.4 ± 1747.3
Heating (768x8)	$ε$ - $l_{2}$ SVR	( $10^{2}$ , $2^{0}$ , 0)	614.3 ± 0.67	1.980 ± 0.21	0.098 ± 0.02	9212.6 ± 213.4
	$ε$ -SVR	( $10^{3}$ , $2^{0}$ , $0.5$ )	421.9 ± 8.28	1.124 ± 0.10	0.621 ± 0.19	143,945.9 ± 27,282.0
	$ε$ -ln SVR	( $10^{3}$ , $2^{0}$ , 0, $2^{0}$ )	614.4 ± 0.52	0.939 ± 0.06	0.267 ± 0.07	91,786.0 ± 2373.9
Airfoil (1503x5)	$ε$ - $l_{2}$ SVR	( $10^{1}$ , $2^{- 2}$ , 0)	1202.4 ± 0.52	3.849 ± 1.32	0.083 ± 0.05	4840.6 ± 130.3
	$ε$ -SVR	( $10^{2}$ , $2^{- 3}$ , $0.2$ )	1103.5 ± 5.97	2.776 ± 0.26	0.147 ± 0.08	28,317.0 ± 3876.7
	$ε$ -ln SVR	( $10^{2}$ , $2^{- 2}$ , 1, $2^{- 1}$ )	789.7 ± 9.52	2.778 ± 0.27	0.120 ± 0.03	15,367.1 ± 451.2
Space ga (3107x6)	$ε$ - $l_{2}$ SVR	( $10^{2}$ , $2^{1}$ , 0)	2485.4 ± 0.70	1.232 ± 0.40	0.243 ± 0.03	8361.1 ± 655.1
	$ε$ -SVR	( $10^{1}$ , $2^{- 2}$ , $0.05$ )	2184.9 ± 11.01	0.134 ± 0.01	0.508 ± 0.11	21,577.9 ± 2162.1
	$ε$ -ln SVR	( $10^{2}$ , $2^{1}$ , 0, $2^{0}$ )	2485.1 ± 0.74	0.116 ± 0.01	0.319 ± 0.06	20,649.9 ± 648.4
Abalone (4177x8)	$ε$ - $l_{2}$ SVR	( $10^{0}$ , $2^{- 2}$ , 0)	3341.2 ± 0.63	2.406 ± 0.22	0.308 ± 0.07	6336.3 ± 119.7
	$ε$ -SVR	( $10^{0}$ , $2^{0}$ , $1.5$ )	1367.8 ± 20.42	2.288 ± 0.12	0.156 ± 0.02	916.5 ± 29.2
	$ε$ -ln SVR	( $10^{1}$ , $2^{1}$ , $1.5$ , $2^{2}$ )	1351.5 ± 19.13	2.208 ± 0.10	0.216 ± 0.06	8339.0 ± 594.8
Cpusmall (8192x12)	$ε$ - $l_{2}$ SVR	( $10^{1}$ , $2^{0}$ , 0)	6553.3 ± 0.48	3.478 ± 0.17	3.411 ± 0.28	14,239.7 ± 127.7
	$ε$ -SVR	( $10^{1}$ , $2^{- 1}$ , 1)	4542.1 ± 28.83	3.257 ± 0.06	0.713 ± 0.28	5976.9 ± 445.6
	$ε$ -ln SVR	( $10^{1}$ , $2^{- 1}$ , $0.5$ , $2^{- 1}$ )	5497.7 ± 16.87	3.126 ± 0.05	2.119 ± 0.53	48,499.2 ± 1151.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kocaoğlu, A. Efficient Optimization of a Support Vector Regression Model with Natural Logarithm of the Hyperbolic Cosine Loss Function for Broader Noise Distribution. Appl. Sci. 2024, 14, 3641. https://doi.org/10.3390/app14093641

AMA Style

Kocaoğlu A. Efficient Optimization of a Support Vector Regression Model with Natural Logarithm of the Hyperbolic Cosine Loss Function for Broader Noise Distribution. Applied Sciences. 2024; 14(9):3641. https://doi.org/10.3390/app14093641

Chicago/Turabian Style

Kocaoğlu, Aykut. 2024. "Efficient Optimization of a Support Vector Regression Model with Natural Logarithm of the Hyperbolic Cosine Loss Function for Broader Noise Distribution" Applied Sciences 14, no. 9: 3641. https://doi.org/10.3390/app14093641

APA Style

Kocaoğlu, A. (2024). Efficient Optimization of a Support Vector Regression Model with Natural Logarithm of the Hyperbolic Cosine Loss Function for Broader Noise Distribution. Applied Sciences, 14(9), 3641. https://doi.org/10.3390/app14093641

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Optimization of a Support Vector Regression Model with Natural Logarithm of the Hyperbolic Cosine Loss Function for Broader Noise Distribution

Abstract

1. Introduction

2. $ε$ -ln SVR and Its Dual Problem

2.1. The Smooth Dual Problem of $ε$ -ln SVR

2.2. The Nonsmooth Version of $ε$ -ln SVR

3. The SMO-like Algorithm for the Nonsmooth Dual Problem of $ε$ -ln SVR

3.1. Decomposition and Solution Based on Brent’s Method

3.2. Stopping Criterion

3.3. Working Set Selection

4. Experiments

5. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Efficient Optimization of a Support Vector Regression Model with Natural Logarithm of the Hyperbolic Cosine Loss Function for Broader Noise Distribution

Abstract

1. Introduction

2. ε -ln SVR and Its Dual Problem

2.1. The Smooth Dual Problem of ε -ln SVR

2.2. The Nonsmooth Version of ε -ln SVR

3. The SMO-like Algorithm for the Nonsmooth Dual Problem of ε -ln SVR

3.1. Decomposition and Solution Based on Brent’s Method

3.2. Stopping Criterion

3.3. Working Set Selection

4. Experiments

5. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2. $ε$ -ln SVR and Its Dual Problem

2.1. The Smooth Dual Problem of $ε$ -ln SVR

2.2. The Nonsmooth Version of $ε$ -ln SVR

3. The SMO-like Algorithm for the Nonsmooth Dual Problem of $ε$ -ln SVR