A Sparse L∞-Norm Regularized Least Squares Support Vector Regression

Liu, Xiaoyong; Li, Dong; Zeng, Chengbin

doi:10.3390/a19020160

Open AccessArticle

A Sparse $L_{\infty}$ -Norm Regularized Least Squares Support Vector Regression

by

Xiaoyong Liu

^*

,

Dong Li

and

Chengbin Zeng

School of Automation Engineering, Moutai Institute, Renhuai 564507, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(2), 160; https://doi.org/10.3390/a19020160

Submission received: 25 January 2026 / Revised: 15 February 2026 / Accepted: 16 February 2026 / Published: 18 February 2026

(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

Although Least Squares Support Vector Regression (LSSVR) reduces the hyperparameter space to two, it sacrifices sparsity, causing all training samples to become support vectors and increasing storage costs. In contrast, standard Support Vector Regression (SVR) preserves sparsity but requires tuning three highly coupled hyperparameters, leading to higher computational burden. To address these limitations, this paper proposes a sparse

L_{\infty}

-norm regularized least squares SVR framework that incorporates the infinity norm of approximation errors into both the objective function and inequality constraints. The resulting optimization problem minimizes model complexity while controlling the maximum prediction deviation through a single slack variable, thereby transforming the conventional three-hyperparameter SVR tuning task into a two-parameter problem involving only the regularization coefficient and kernel width. This formulation restores sparsity by enabling a compact support vector set, while preserving the stability and convexity advantages of LSSVR. Experiments on both static and dynamic datasets demonstrate that the proposed method consistently achieves higher predictive accuracy and improved robustness compared with standard SVR and LSSVR. These results indicate that the proposed

L_{\infty}

-norm regularized framework offers a mathematically principled and computationally efficient alternative for sparse, robust, and scalable regression modeling.

Keywords:

sparsity; Least Squares Support Vector Regression (LSSVR);

ℒ

_∞-norm regularization; robust regression

1. Introduction

Guaranteeing the sparsity, modeling accuracy, and efficient optimization of a reduced number of hyperparameters in constructing robust regression models is one of the critical unresolved issues in the realm of machine learning. Practitioners in fields ranging from theoretical research to engineering applications are hesitant to implement machine learning solutions, as the process of constructing predictive regression models with a better generalization performance is not transparent. Support Vector Regression (SVR) [1,2] is a well-known type of non-parametric supervised learning model that integrates the control of model structural risk into machine learning. The risk minimization theory allows for a balance between empirical risk and structural risk, and therefore, SVR surmounts a significant shortcoming of other, nonlinear modeling methods, namely over-fitting [3]. Shortly after its introduction, SVR has been extended in numerous applications, ranging from fault detection [4,5,6], pattern recognition [7,8] and predictive control [9] to brain–computer interfaces [10].

SVR is viewed as one of the most effective and commonly applied techniques in predictive modeling, with improvements in its predictive performance and generalization capacity significantly relying on the adjustment of its hyperparameters. Although it reveals the generalization capability in regression by introducing the

ϵ

-insensitive margin, providing new insights into controlling the structural complexity of models, the optimization process implementing the selection of three hyperparameters, which includes regularization factor

γ

, kernel parameter

σ

, and

ϵ

, makes the regression modeling method more complex [11,12,13,14,15,16,17]. As mentioned in literature [18], the training of SVR is time-consuming on large-scale datasets, especially when it comes to hyperparameter optimization. It is currently reported that finding the optimal hyperparameters in machine learning models efficiently without sacrificing prediction accuracy poses major hurdles [19]. Prevailing techniques like Grid-Search and Random-Search with cross-validation for optimizing the hyperparameters require considerable time in large-scale search spaces, and they do not provide a guarantee of excellent predictive results. Parameters can be categorized into two groups: learnable parameters and hyperparameters. Learnable parameters are obtained from training data using optimization techniques such as quadratic programming (QP), linear programming (LP), or gradient descent. Hyperparameters are not obtained from the data; instead, they must be chosen before starting the learning process, the optimization of which involves identifying the best combination of hyperparameters that either reduces prediction error or enhances prediction accuracy. As a result, hyperparameter optimization plays a vital role in developing powerful and efficient machine learning models.

Various methods exist for hyperparameter optimization in machine learning models, including grid search, probabilistic model-based optimization, random search [20,21], gradient descent algorithms, and nature-inspired optimization methods. Grid search methods incorporating performance metric [22,23] or a weighted error function [19] aim to find the optimal combination of hyperparameters through iterative updates. From the standpoint of Bayesian methods, probabilistic model-based optimization [24,25] employs a Bayesian model to assess the performance of various hyperparameter configurations and chooses the most advantageous options for evaluation. Gradient descent-based approaches are a widely used class of conventional optimization techniques for adjusting real-valued hyperparameters by computing their gradients. Nature-inspired optimization is another method that employs evolution-based techniques, among them genetic algorithms [26] or particle swarm optimization [27], to explore the range of potential hyperparameters. Additionally, the literature [28] proposed an effective warm-start technique as a practical tool for parameter selection in linear SVR, and the approach has been available for public use. The aforementioned optimization methods aim to find the optimal selection to guarantee a better generalization ability by exhaustively exploring all possible combinations in the hyperparameter space. For instance, there are three hyperparameters in SVR, each with 100 options within its respective space, resulting in a total of

10^{6}

combinations. However, if the number of hyperparameters is reduced to two, the total size of possible combinations will decrease from one million to ten thousand. Thus, the hyperparameter optimization process for SVR will be greatly simplified and improved in terms of both time and spatial scale.

Significant efforts have been dedicated to exploring solutions for the redundancy during the hyperparameter optimization of SVR. In this regard, LSSVR, as a variant of SVR and first introduced in [29,30], has been successfully explored and emerged as a prominent technique in machine learning, particularly for classification and regression problems. In the LSSVR framework—rather than addressing a quadratic programming problem with inequality constraints as in the conventional SVR—equality constraints and an

L 2

-loss function are employed. This results in an optimization problem whose dual solution is found by solving a system of linear equations. More importantly, compared to SVR, LSSVR requires the optimization of only two hyperparameters, namely the kernel width

σ

and the regularization factor

γ

, significantly improving the optimization process. As a result, LSSVR has seen widespread application and research in recent years, with over 15,000 citations. For instance, it has been successfully applied in short-term traffic flow prediction [31], optimal control [32], interval-censored data [33] and model identification [34]. Unfortunately, since the emergence of LSSVR, researchers have found that the method lacks model sparsity, meaning that all or most of the samples involved in model training contribute to the regression model, which can easily lead to poor generalization performance. Addressing model sparsity has become a current research hotspot. An early sparse approximation method [35,36] was proposed by Johan Suykens, which can perform pruning based on the physical interpretation of the magnitude of the contribution of ordered support values. Follow-up works have sought to address these challenges by introducing an adaptive pruning algorithm [37,38], Nyström approximation method along with prototype vectors (PVs) [39], gradient descent technique [40] and non-iterative sparse scheme with global representative point selection [41].

However, these methods remain fundamentally limited in achieving true sparsity. Most approaches, such as pruning or global representative point selection, are inherently a posteriori approximations or filtering processes that artificially induce sparsity by discarding samples from the dense solution, inevitably leading to information loss and performance trade-offs. Moreover, variants employing re-weighted or

ϵ

-insensitive loss functions, while introducing sparsity-inducing mechanisms, still rely on heuristic thresholds or complex regularization without fundamentally altering the underlying structure of the least squares problem, resulting in suboptimal, incomplete, or heavily parameter-dependent sparse solutions. The so-called model sparsity, as described in SVR, refers to the fact that when the value of the Lagrange multiplier is relatively small (e.g., its absolute value is less than

10^{- 8}

), the corresponding training samples are not support vectors (SVs) and can be ignored. From the perspective of practical applications or current software tools, the core ideas of traditional SVR and LSSVR still dominate the mainstream, including the widely used MATLAB 2025a software. To the best of our knowledge, the following issues remain to be investigated: (1) The optimization of the three hyperparameter combinations in SVR is complex. (2) The introduction of slack variables within SVR results in additional inequality constraints, making the parameter estimation more challenging. (3) The expected sparsity of LSSVR has essentially not been addressed, and despite some proposed improvements, they still come at the cost of model accuracy.

It is worth noting that the use of the

ℓ_{\infty}

-norm, min–max formulations, and Chebyshev-type approximation has been extensively studied in the optimization and robust regression literature. Therefore, the contribution of this work does not aim to introduce a fundamentally new norm-based regression theory. Instead, this study focuses on embedding an

ℓ_{\infty}

-norm constraint into the SVR framework, with particular attention to kernel-based modeling, sparsity, and computational efficiency.

Unlike classical Chebyshev regression or generic

ℓ_{\infty}

-norm minimization approaches, the proposed LI-SVR is derived within the primal–dual structure of SVR and LSSVR. By reformulating the approximation error using a single global variable

λ

, the resulting optimization problem remains convex and requires only two hyperparameters, namely the regularization parameter and kernel width. Moreover, the number of inequality constraints is reduced to

2 n

, which is half of that required by standard

ϵ

-SVR formulations.

More importantly, the proposed formulation naturally restores model sparsity through the Karush–Kuhn–Tucker (KKT) conditions. Only samples whose prediction errors attain the maximum bound actively constrain the solution and become support vectors, while the remaining samples are strictly inactive. This sparsity mechanism differs fundamentally from post hoc pruning strategies or explicit sparsity-inducing regularizers, and leads to a compact kernel regression model without sacrificing convexity or introducing additional tuning parameters.

From this perspective, the proposed LI-SVR should be viewed as a structurally simplified and sparsity-enhanced kernel regression framework inspired by

ℓ_{\infty}

-norm optimization, rather than as a novel norm definition or a replacement of existing robust regression theories.

Inspired by the core ideas of classical SVR and LSSVR, the main contributions of this work are as follows.

By integrating the concept of infinite norm into our research, we investigate the problem of lacking sparsity of LSSVR and optimizing hyperparameters of $ϵ$ in SVR, exploring a more efficient and generalizable regression prediction method that ensures both model sparsity and modeling accuracy;
We propose a novel minimum infinite norm Support Vector Regression (LI-SVR) method, distinct from classical SVR and LSSVR, to explore the optimal balance between the complex hyperparameter optimization process and sparsity, overcoming the limitations of traditional regression modeling methods;
We demonstrate the sparsity of the proposed method and explain why it does not require optimization of the epsilon-insensitive parameter of $ϵ$ in SVR. Furthermore, we find the maximum approximation error, which is adaptively computed using the proposed method to reflect model accuracy, thereby avoiding the complicated optimization of $ϵ$ .

The rest of this article is structured as follows. Section 2 introduces the theory underlying the algorithms, while the proposed LI-SVR is detailed in Section 3. Section 4 presents the experimental analysis of the results. Finally, the conclusions would be gathered in Section 5.

2. A Concise Overview of Classical SVR and LSSVR

2.1. SVR

As our method in this paper is inspired by SVR and LSSVR, a concise summary of their principles is provided. It can be asserted that SVR developed by Vladimir Vapnik with colleagues [1] offers notable and distinct advantages in small sample learning and is among the most extensively researched models, founded on the statistical learning principles of VC theory. Given a training set with N input–output samples,

x_{k} \in R^{d}

and

y_{k} \in R

,

(x_{k}, y_{k}), k = 1, 2, \dots, N

(1)

SVR aims to determine a nonlinear regression function,

f_{S V R} (x) = w^{T} ϕ (x_{k}) + b

(2)

where

f_{S V R} (x)

can accurately simulate all training data. Since linear regression falls under the umbrella of nonlinear modeling techniques, the following discussion will shift to nonlinear SVR, where its solution commences with the incorporation of high-dimensional feature mapping

ϕ (\cdot)

to formulate the subsequent optimization problem [2],

\begin{matrix} \min : J (w, b) = \frac{1}{2} w^{T} w + γ \sum_{k = 1}^{N} (ξ_{k} + ξ_{k}^{*}) \end{matrix}

(3)

\begin{matrix} s . t . \{\begin{matrix} y_{k} - w^{T} ϕ (x_{k}) - b ⩽ ϵ + ξ_{k}, \\ - y_{k} + w^{T} ϕ (x_{k}) + b ⩽ ϵ + ξ_{k}^{*}, \\ ξ_{k}, ξ_{k}^{*} \geq 0, k = 1, 2, \dots, N, \end{matrix} \end{matrix}

(4)

where the constant

γ

regulates the balance between smoothness of

f_{S V R}

and the extent to which deviations exceeding

ϵ

are permissible, and the slack variables

ξ_{k}, ξ_{k}^{*}

are employed as a means to address potentially unattainable constraints within the optimization framework, ensuring feasibility is maintained.

The essence lies in crafting a Lagrange function L, fusing the objective function (3) with its associated constraints (4) through the introduction of a dual variable set. Importantly, this function possesses a saddle point, relative to both primal and dual variables, precisely at the optimal solution, as elaborated below,

\begin{matrix} L = & \frac{1}{2} w^{T} w + γ \sum_{k = 1}^{N} (ξ_{k} + ξ_{k}^{*}) - \sum_{k = 1}^{N} (η_{k} ξ_{k} + η_{k}^{*} ξ_{k}^{*}) \\ + \sum_{k = 1}^{n} α_{k} (y_{k} - w^{T} ϕ (x_{k}) - b - ϵ - ξ_{k}) \\ + \sum_{k = 1}^{n} α_{k}^{*} (- y_{k} + w^{T} ϕ (x_{k}) + b - ϵ - ξ_{k}^{*}) \end{matrix}

(5)

where

α_{k}, α_{k}^{*}, η_{k}, η_{k}^{*}

are Lagrange multipliers, all of which are greater than or equal to zero. By eliminating the primal variables

w, b, ξ_{k}, ξ_{k}^{*}

based on the saddle point conditions, we can compute

w

as,

\frac{\partial L}{\partial w} = 0 \Rightarrow w = \sum_{k = 1}^{n} (α_{k} - α_{k}^{*}) ϕ (x_{k})

(6)

and the corresponding dual optimization problem is obtained,

\begin{matrix} \min : & J (α_{k}, α_{k}^{*}) = - \sum_{k = 1}^{N} (α_{k} - α_{k}^{*}) y (k) + ϵ \sum_{k = 1}^{N} (α_{k} + α_{k}^{*}) \\ + \frac{1}{2} \sum_{j = 1}^{N} \sum_{k = 1}^{N} (α_{j} - α_{j}^{*}) (α_{k} - α_{k}^{*}) ϕ^{T} (x_{j}) ϕ (x_{k})) \\ s . t . \sum_{k = 1}^{N} (α_{k} - α_{k}^{*}) = 0, 0 \leq α_{k}, α_{k}^{*} \leq γ \end{matrix}

(7)

Based on the Karush–Kuhn–Tucker (KKT) conditions [2], it can be inferred that

α_{k} α_{k}^{*} = 0

. Additionally, it has been proven that the Lagrange multiplier is non-zero only when the condition

f (x_{k}) - y_{k} \geq ϵ

is satisfied, meaning that the k-th sample lies precisely on or outside the

ϵ

-tube and is referred to as a support vector (SV). To reiterate, the KKT conditions are met solely when the Lagrange multipliers corresponding to all samples that lie within the

ϵ

-tube (i.e.,

f (x_{k}) - y_{k} \leq ϵ

) are exactly zero, and this is why SVR exhibits sparsity.

2.2. LSSVR

LSSVR is based on a standard SVR but substitutes the

ϵ

-insensitive loss function with a least squares. It transforms the inequality constraints of SVR into an equality one, thereby converting the quadratic programming problem into a linear equation system. Similarly, LSSVR also attempts to find a better regression function

f (x)

to approximate all training data. By introducing a regularization factor

γ

to fine-tune the degree of optimality between model complexity and error

e_{k}

, it yields the following optimization [29],

\begin{matrix} \min : J (w, b, e) = \frac{1}{2} w^{T} w + \frac{γ}{2} \sum_{k = 1}^{n} e_{k}^{2} \end{matrix}

(8)

\begin{matrix} s . t y_{k} = w^{T} ϕ (x_{k}) + b + e_{k} . \end{matrix}

(9)

To solve this optimization, the Lagrangian function is constructed using Lagrange multipliers

α_{k}

,

\begin{matrix} L (w, b, e, α) = \frac{1}{2} w^{T} w + \frac{γ}{2} \sum_{k = 1}^{n} e_{k}^{2} \\ + \sum_{k = 1}^{n} α_{k} (y_{k} - w^{T} ϕ (x_{k}) - b - e_{k}) \end{matrix}

(10)

Taking partial derivatives of the Lagrangian function with respect to

w, b, e, α

respectively, and setting them to zero, it gets,

\begin{matrix} \frac{\partial L}{\partial w} = 0 \Rightarrow w = \sum_{k = 1}^{n} α_{i} ϕ (x_{i}) \end{matrix}

(11)

\begin{matrix} \frac{\partial L}{\partial b} = 0 \Rightarrow \sum_{k = 1}^{n} α_{k} = 0 \end{matrix}

(12)

\begin{matrix} \frac{\partial L}{\partial e_{k}} = 0 \Rightarrow α_{k} = γ e_{k} \end{matrix}

(13)

\begin{matrix} \frac{\partial L}{\partial α_{k}} = 0 \Rightarrow y_{k} = w^{T} ϕ (x_{k}) + b + e_{k} \end{matrix}

(14)

Substituting the expression for

w

into the constraint (9) and introducing a kernel function

K (x_{i}, x_{j}) = ϕ {(x_{i})}^{T} ϕ (x_{j})

, Formula (14) can be rewritten as,

\begin{matrix} y_{k} = \sum_{j = 1}^{n} α_{j} K (x_{j}, x_{k}) + b + e_{k}, k = 1, 2, \dots, n \end{matrix}

(15)

Combining Formulas (13)–(15), a system of linear equations represented in vectors is formed,

\begin{matrix} [\begin{matrix} 0 & 1_{n}^{T} \\ 1_{n} & K + γ^{- 1} I_{n} \end{matrix}] [\begin{matrix} b \\ α \end{matrix}] = [\begin{matrix} 0 \\ y \end{matrix}], \end{matrix}

(16)

where, the elements of kernel matrix

K

are computed by kernel function

K (x_{j}, x_{k}), j, k = 1, 2, \dots, n

,

y = {(y_{1}, y_{2}, \dots, y_{n})}^{T}, α = {(α_{1}, α_{2}, \dots, α_{n})}^{T}, 1_{n} = {(1, 1, \dots, 1)}^{T}

. After solving Equation (16), the predicted output of

f (x)

for the LSSVR can be obtained,

\begin{matrix} f (x) = w^{T} ϕ (x) + b = \sum_{k = 1}^{n} α_{k} K (x, x_{k}) + b . \end{matrix}

(17)

The process of constructing the LSSVR has the following advantages: (1) converting quadratic programming problems into linear equations significantly improves computational efficiency; (2) introducing the regularization parameter

γ

to control model complexity effectively can prevent overfitting and enhance model generalization capability; (3) importantly, the number of selected hyperparameters is only two, while SVR has three with an additional insensitivity margin

ϵ

. Unfortunately, the use of

L_{2}

-norm of error in LSSVR deprives the model sparsity existing in SVR. The fundamental reason lies in its optimization formulation. The

L_{2}

-norm loss, coupled with equality constraints, forces all training samples to satisfy the Karush–Kuhn–Tucker conditions strictly, resulting in non-zero Lagrange multipliers for every data point. This mechanism lacks the threshold-induced sparsity inherent in the

ϵ

-insensitive loss of standard SVR, where only samples outside the

ϵ

-tube contribute support vectors, thereby eliminating the model’s capacity for sparse representation.

2.3. Comparison with Related SVR Variants and Robust Regression Methods

To further clarify the position of the proposed LI-SVR, this section provides a conceptual and structural comparison with SVR and LSSVR variants, as shown in Figure 1. From Equations (8) and (9), it can be observed that in the parameter optimization of LSSVR, the training errors of all training samples are considered, which gives rise to a loss of the model’s sparsity. Standard

ϵ

-SVR achieves sparsity by introducing an

ϵ

-insensitive tube, where samples lying inside the tube do not contribute to the regression model. However, the performance and sparsity of

ϵ

-SVR critically depend on the preselection of

ϵ

, which is typically determined heuristically or through an extensive grid search [3]. Adaptive

ϵ

strategies have been proposed to alleviate this issue, but they often introduce additional algorithmic complexity, data-dependent heuristics, or iterative update rules. In contrast, the proposed LI-SVR eliminates the explicit

ϵ

parameter and instead introduces a single global envelope variable

λ

, which is jointly optimized with the regression function. This formulation avoids manual tuning of

ϵ

while preserving sparsity through the KKT conditions.

ℓ_{1}

-SVR and least absolute deviation regression promote robustness by penalizing absolute errors, while quantile regression focuses on estimating conditional quantiles rather than central tendencies. Although these methods are effective for handling non-Gaussian noise or asymmetric error distributions, their objectives differ fundamentally from the proposed LI-SVR. Specifically, LI-SVR aims to control the maximum approximation error via an

ℓ_{\infty}

-type envelope, rather than minimizing cumulative absolute deviations or targeting specific quantiles. As a result, LI-SVR is particularly suitable for applications requiring bounded worst-case prediction errors. Classical robust regression methods, such as Huber regression and Chebyshev approximation, have long employed

ℓ_{\infty}

-norm or hybrid loss functions to enhance robustness against outliers. However, these approaches are typically formulated in linear or non-kernelized settings and do not explicitly exploit the primal–dual structure of support vector machines. Moreover, they do not directly address model sparsity in kernel-based regression. By contrast, the proposed LI-SVR integrates an

ℓ_{\infty}

-type constraint into the SVR/LSSVR framework, yielding a sparse kernel regression model with a compact set of support vectors.

Motivated by both

ϵ

-SVR and LSSVR, we introduce the infinite norm on approximation errors to address the aforementioned issues. In the proposed LI-SVR, sparsity emerges intrinsically from the optimization structure and KKT conditions, without requiring additional sparsity-inducing regularizers or post hoc pruning. This leads to a simpler and more interpretable sparse regression model.

3. SVR and LSSVR-Inspired Least $L_{\infty}$ -Norm Support Vector Regression

Based on the comparative analysis in Section 2, it is evident that SVR achieves sparsity at the cost of increased hyperparameter complexity, whereas LSSVR simplifies optimization but loses sparsity due to its

L_{2}

-norm error formulation. Section 3 builds directly upon this observation by reformulating the LSSVR objective using an

L_{\infty}

-norm error criterion, with the aim of restoring intrinsic sparsity while preserving convexity and limiting the number of tunable hyperparameters to two.

Actually, model sparsity is guaranteed by whether the absolute values of the Lagrange multipliers solved are sufficiently small. Given that LSSVR utilizes the

L_{2}

-norm of the error to assess modeling accuracy, it inherently lacks model sparsity. In contrast, SVR obtains sparsity by introducing

ϵ

, but it adds the complexity of optimizing an additional hyperparameter. To deal with these issues, we propose the least infinity (

L_{\infty}

)-norm Support Vector Regression (LI-SVR) method, as shown in Figure 2.

3.1. The $L_{\infty}$ -Norm on Approximation Error

The

L_{\infty}

-norm (also known as the maximum norm) is defined as the largest element of the absolute value in a vector. In contrast, the

L_{2}

-norm is a square root of the sum of squares in all elements, which calculates the length or distance of the vector in Euclidean space.

Assuming the obtained input–output data and the regression model to be solved are represented as in (1) and (2), respectively, where the measurement of

y_{k}

satisfies the following nonlinear system,

y_{k} = g (x_{k}), k = 1, 2, \dots, n .

(18)

According to statistical theory, there exists an arbitrary approximation of the measurement model g by the nonlinear regression model f described in Equation (2). As the approximation precision decreases, fewer SVs are needed in SVR; conversely, more SVs are required. Hence, for any given real continuous function g and

η > 0

, a regression model f can be found,

\sup_{x_{k} \in S} | f (x_{k}) - g (x_{k}) | < η, \forall k .

(19)

In traditional LSSVR, the following deviation

e_{k}, k = 1, 2, \dots, n

is defined,

\begin{matrix} e_{1} = f (x_{1}) - y_{1} = w^{T} ϕ (x_{1}) + b - y_{1} \\ e_{2} = f (x_{2}) - y_{2} = w^{T} ϕ (x_{2}) + b - y_{2} \\ ⋮ \\ e_{n} = f (x_{n}) - y_{n} = w^{T} ϕ (x_{n}) + b - y_{n} . \end{matrix}

(20)

and its optimization problem employs the

L_{2}

-norm of deviations, implying that all deviations in Formula (20) contribute to the solution of Lagrange multipliers, which is the primary reason for the loss of sparsity.

Now, we introduce

L_{\infty}

-norm on the approximation error to replace the

L_{2}

-norm and solve for the regression model f satisfying condition (19). Initially, we search for the maximum absolute deviation

λ

in Formula (20), defined as the

L_{\infty}

-norm of the approximation error. Next, the minimization of

λ

can be transformed into a

\min - \max

optimization problem to get intrinsic sparsity,

\begin{matrix} α & = \min \max_{x_{k} \in Z} | y_{k} - f (x_{k}) |, \forall k, \\ = \arg \min_{w, b} \max_{x_{k} \in Z} | y_{k} - w^{T} ϕ (x_{k}) - b | . \end{matrix}

(21)

where

Z

denotes the finite training set. For SVR, the interpretation of sparsity implies that contributions of input samples within the pipeline are disregarded by the regression model. Actually, the essentially sparse nature proposed in this paper borrows from the idea of SVR. That is to say, when the deviation

e_{k}

satisfies the condition of

- λ < e_{k} < λ

, the contribution of the corresponding sample

x_{k}

will be negligible in (21).

Lemma 1.

The

\min - \max

optimization problem (21) can be addressed through linear programming,

\begin{matrix} \min : λ, \\ s . t . \{\begin{matrix} y_{k} - w^{T} ϕ (x_{k}) - b ⩽ λ, \\ - y_{k} + w^{T} ϕ (x_{k}) + b ⩽ λ \\ λ ⩾ 0, k = 1, 2, \dots, n, \end{matrix} \end{matrix}

(22)

where

w

and b are the parameters to be determined, and λ represents the maximum approximation error.

Proof.

Define

λ

as follows,

λ = \max |y_{k} - w^{T} ϕ (x_{k}) - b|

(23)

and the following inequality can be directly derived,

|y_{k} - w^{T} ϕ (x_{k}) - b| ⩽ λ

(24)

By applying the absolute value operation, the system of inequalities (24) can be rewritten in the form of (22). This completes the proof of Lemma 1. □

The sparsity-promoting property of the

ℓ_{\infty}

-norm formulation can be intuitively understood from its min–max structure. As shown in (22), the optimization seeks to minimize a single scalar variable

λ

, which represents the maximum approximation error shared by all training samples. Unlike

ℓ_{2}

-based formulations that accumulate errors over all samples, the objective here is governed solely by the worst-case deviation. As a result, only those samples whose residuals attain the maximum bound

λ

actively constrain the solution, while samples with smaller errors remain strictly inactive. Consequently, the regression model is effectively determined by a limited subset of critical samples, which naturally leads to a sparse representation. Next, this lemma will be used to establish a sparse regression model in our approach.

3.2. Establishment of Sparse Kernel Regression Model

In fact, early studies [42,43] on the classification of machine learning methods already discussed how the sparsity of models could be enhanced by leveraging the concept of the

L_{\infty}

-norm. A good regression model f should be a comprehensive embodiment of both model sparsity and modeling accuracy, as exemplified by SVR and LSSVR. However, SVR’s optimization problem has the following two drawbacks compared to LSSVR: (1) with N training samples, the introduction of slack variables of

ξ_{k}^{+}, ξ_{k}^{-}

increases the number of inequality constraints by

2 N

; (2) it requires the selection of three hyperparameters, including the kernel parameter

σ

, regularization factor

γ

, and insensitivity margin

ϵ

, while LSSVR only has two of

σ, γ

. In light of this [42,43], while maintaining the same number of hyperparameters as LSSVR, we propose the least

L_{\infty}

-norm SVR(LI-SVR). Hence, drawing inspiration from LSSVR, we substitute the

L_{2}

-norm optimization with the

\min - \max

problem (22) and yield a new optimization,

\begin{matrix} \min : J (w, b, e) = \frac{1}{2} w^{T} w + \frac{γ}{2} λ^{2} \end{matrix}

(25)

\begin{matrix} s . t . \{\begin{matrix} y_{k} - w^{T} ϕ (x_{k}) - b ⩽ λ, \\ - y_{k} + w^{T} ϕ (x_{k}) + b ⩽ λ \\ λ > 0, k = 1, 2, \dots, n, \end{matrix} \end{matrix}

(26)

where the first term

\frac{1}{2} w^{T} w

in (25) controls the model structure, which is minimized to favor smoother model characteristics; the second term

λ^{2}

primarily focuses on improving model accuracy;

γ > 0

is a regularization factor, used to balance the weights between model structure and modeling accuracy. What should be emphasized is that in inequality constraint (26), we only consider

λ > 0

. This is because

λ

represents the maximum one among all absolute deviations. If

λ = 0

, then

|w^{T} ϕ (x_{k}) + b - y_{k}| = 0

, indicating that regression model

f (x)

can completely fit the actual outputs y, which has a high potential to trigger severe overfitting and result in extreme degradation of the model’s generalization performance. It should be emphasized that

λ

is an optimization variable rather than a hyperparameter. Unlike the

ϵ

parameter in SVR, which must be manually specified prior to training,

λ

is automatically determined during optimization as the maximum approximation error, as shown in Equation (29). Therefore, the proposed method involves only two hyperparameters: the regularization coefficient

γ

and the kernel width

σ

.

To solve for the weight vector

w

and the bias term b, we leverage the Lagrange function with Lagrange multipliers

α_{k}

and

β_{k}

to convert the constraints into an unconstrained optimization problem,

\begin{matrix} L (w, b, λ, α_{k}, β_{k}) = & \frac{1}{2} w^{T} w + \frac{γ}{2} λ^{2} + \sum_{k = 1}^{n} α_{k} (y_{k} - w^{T} ϕ (x_{k}) - b - λ) \\ + \sum_{k = 1}^{n} β_{k} (w^{T} ϕ (x_{k}) + b - y_{k} - λ) . \end{matrix}

(27)

To achieve the optimal solution, partial derivatives of the Lagrange function with regard to

w, b, λ, α_{k}, β_{k}

are computed using the Karush–Kuhn–Tucker condition and equated to zero, respectively,

\begin{matrix} \frac{\partial L}{\partial w} & = w - \sum_{k = 1}^{n} α_{k} ϕ (x_{k}) + \sum_{k = 1}^{n} β_{k} ϕ (x_{k}) = 0 \\ \Rightarrow w = \sum_{k = 1}^{n} (α_{k} - β_{k}) ϕ (x_{k}), \end{matrix}

here, the weight vector

w

can be viewed as a linear expansion of coefficients

α_{k} - β_{k}

in relation to the nonlinear feature mapping

ϕ (x)

.

\begin{matrix} \frac{\partial L}{\partial λ} & = λ γ - \sum_{k = 1}^{n} α_{k} - \sum_{k = 1}^{n} β_{k} = 0 \\ \Rightarrow λ = \frac{1}{γ} \sum_{k = 1}^{n} (α_{k} + β_{k}), \end{matrix}

(28)

From Equation (28), it can be observed that

λ

is automatically calculated via

α_{k}

and

β_{k}

, eliminating the need for prior determination.

\frac{\partial L}{\partial b} = \sum_{k = 1}^{n} (- α_{k}) + \sum_{k = 1}^{n} β_{k} = 0 \Rightarrow \sum_{k = 1}^{n} α_{k} = \sum_{k = 1}^{n} β_{k}

(29)

\begin{matrix} \frac{\partial L}{\partial α_{k}} & = y_{k} - (w^{T} ϕ (x) + b) - λ = 0 \\ \Rightarrow y_{k} = w^{T} ϕ (x_{k}) + b + λ \end{matrix}

(30)

\begin{matrix} \frac{\partial L}{\partial β_{k}} & = w^{T} ϕ (x) + b - y_{k} - λ = 0 \\ \Rightarrow y_{k} = w^{T} ϕ (x_{k}) + b - λ \\ β_{k} \geq 0, α_{k} \geq 0, λ > 0, k = 1, 2, \dots, n \end{matrix}

(31)

where Equations (30) and (31) will be employed for analyzing the sparsity properties and determining whether the k-th sample

x_{k}

qualifies as an SV.

Now, substituting Formula (28) into (27), we obtain,

\begin{matrix} L (α_{k}, β_{k}) = & - \frac{1}{2 γ} {(\sum_{k = 1}^{n} (α_{k} + β_{k}))}^{2} + \sum_{k = 1}^{n} (α_{k} - β_{k}) y_{k} \\ - \frac{1}{2} \sum_{j = 1}^{n} \sum_{k = 1}^{n} (α_{j} - β_{j}) (α_{k} - β_{k}) ϕ^{T} (x_{j}) ϕ (x_{k}) \end{matrix}

(32)

where the constraints of

β_{k} \geq 0

and

α_{k} \geq 0

in (32) must be met. Maximizing Formula (27) is equivalent to the maximization of (32). However, the maximization of (32) can be reformulated as its dual problem,

\begin{matrix} \min : & J (α, β) = \frac{1}{2 γ} {(\sum_{k = 1}^{n} (α_{k} + β_{k}))}^{2} + \sum_{k = 1}^{n} (- α_{k} + β_{k}) y_{k} \\ + \frac{1}{2} \sum_{j = 1}^{n} \sum_{k = 1}^{n} (α_{j} - β_{j}) (α_{k} - β_{k}) ϕ^{T} (x_{j}) ϕ (x_{k}) \\ s . t & \sum_{k = 1}^{N} (α_{k} - β_{k}) = 0, \\ α_{k} \geq 0, β_{k} \geq 0, k = 1, 2, \dots, n . \end{matrix}

(33)

Theorem 1.

Optimization problem (33) can be solved through quadratic programming, and the solution of

α_{k}

and

β_{k}

is globally optimal.

Proof.

We only need to transform optimization (33) into vector form, as seen in Equation (34), and the theorem is proven. □

\begin{matrix} \min : J (α, β) = \frac{1}{2} {(\begin{matrix} α \\ β \end{matrix})}^{T} (\begin{matrix} K + \frac{1}{γ} I & - K + \frac{1}{γ} I \\ - K + \frac{1}{γ} I & K + \frac{1}{γ} I \end{matrix}) (\begin{matrix} α \\ β \end{matrix}) \\ + {(\begin{matrix} - y \\ y \end{matrix})}^{T} (\begin{matrix} α \\ β \end{matrix}) \\ s . t . (Θ^{T} - Θ^{T}) (\begin{matrix} α \\ β \end{matrix}) = 0, \\ β \geq 0, α \geq 0 . \end{matrix}

(34)

Here,

\begin{matrix} α = {(α_{1}, α_{2}, \dots, α_{n})}^{T}, β = {(β_{1}, β_{2}, \dots, β_{n})}^{T}, \\ y = {(y_{1}, y_{2}, \dots, y_{n})}^{T}, Z = {(0, 0, \dots, 0)}^{T} \\ K (j, k) = K (x_{j}, x_{k}) = ϕ^{T} (x_{j}) ϕ (x_{k}), \\ K = (\begin{matrix} K (1, 1) & K (1, 2) & \dots & K (1, n) \\ ⋮ & ⋮ & ⋮ \\ K (n, 1) & K (n, 2) & \dots & K (n, n) \end{matrix}), \end{matrix}

I

is an n-dimensional matrix filled with elements that are all equal to 1,

Θ

represents a column vector with all elements equal to 1, the entries of kernel matrix

K

are calculated using kernel function

K (x_{i}, x_{j}) = ϕ^{T} (x_{j}) ϕ (x_{k})

, which includes commonly used ones such as polynomial, radial basis, and sigmoid kernels, etc. It has been demonstrated that the radial basis kernel, also known as the Gaussian kernel, exhibits superior performance over other kernel functions in handling nonlinear regression problems. Therefore, the Gaussian kernel function with kernel width

σ

is adopted in this study,

\begin{matrix} K (x_{i}, x_{j}) = \exp (\frac{- ∥ x_{i} - x_{j} ∥^{2}}{2 σ^{2}}) . \end{matrix}

(35)

As the optimization (32) is a typical quadratic programming problem, the solution obtained is globally optimal, thus avoiding local optima. After solving the optimization problem of (32), we found that the magnitudes of most Lagrange multipliers of

α_{k}

and

β_{k}

are sufficiently small, even negligible. The maximum approximation error

λ

can be computed based on (28). Subsequently, we derive a bias term b from Equation (30),

\begin{matrix} b & = y_{k} - \sum_{j = 1}^{n} (α_{j} - β_{j}) K (x_{k}, x_{j}) - λ \\ = \frac{(y_{k} - \sum_{j = 1, x_{j} \in S V}^{n_{s v 1}} α_{j} K (x_{k}, x_{j}) - λ)}{n_{s v 1}} \end{matrix}

(36)

or (31),

\begin{matrix} b & = y_{k} - \sum_{j = 1}^{n} (α_{j} - β_{j}) K (x_{k}, x_{j}) + λ \\ = \frac{(y_{k} - \sum_{j = 1, x_{j} \in S V}^{n_{s v 2}} - β_{j} K (x_{k}, x_{j}) + λ)}{n_{s v 2}} \end{matrix}

(37)

Here,

n_{s v 1}

and

n_{s v 2}

represent the number of significant differences from zero for components

α_{k}

and

β_{k}

in

α

and

β

, respectively. Assuming that the total number of SVs (support vectors) is

n_{s v}

, and these SVs are represented as

S V_{j}, j = 1, 2, \dots, n_{s v}

, we can derive an essentially sparse regression model, namely LI-SVR, comparable to SVR and LSSVR,

\begin{matrix} f (x) = \sum_{j = 1}^{n_{s v}} (α_{j} - β_{j}) K (x, {S V}_{j}) + b . \end{matrix}

(38)

It is also revealed in Section 3.3 that

n_{s v}

is substantially smaller than n. In practice, the quadratic programming problem in Theorem 1 is solved using standard interior-point solvers. Numerical stability is ensured by kernel matrix regularization and normalization of input features. All experiments employ identical solver tolerances to ensure fair comparisons.

3.3. Support Vectors Analysis for Sparse LI-SVR

The number of SVs plays a crucial role in determining the model complexity of SVR. Traditionally, LSSVR lacks sparsity, meaning that all samples contribute to the model construction. This not only increases the structural complexity of the model, leading to potential overfitting issues, but also deteriorates its generalization performance. Therefore, the method proposed in this paper can essentially produce a sparse regression model after solving the optimization problem (33). The quality of sparsity depends on the number of SVs. In general, more SVs indicate a poorer sparsity, while fewer SVs imply a better one. Below, we will analyze how the SVs are generated in the proposed sparse LI-SVR.

Firstly, we define a deviation

{\tilde{e}}_{k}

between the actual output

y_{k}

and predicted values

{\tilde{y}}_{k}

as follows,

\begin{matrix} {\tilde{e}}_{k} & = y_{k} - {\tilde{y}}_{k} \\ = y_{k} - \sum_{j = 1}^{n} (α_{j} - β_{j}) K (x_{k}, x_{j}) - b . \end{matrix}

(39)

where

k = 1, 2, \dots, n

. Upon completing the solution to the convex quadratic programming problem (33), the maximum approximation error

λ

can be computed from Equation (28). According to conditions (23) and (30), we further obtain,

\begin{matrix} \{\begin{matrix} λ = \max {{\tilde{e}}_{k}}, \\ k = 1, 2, \dots, n, λ > 0 . \end{matrix} \end{matrix}

(40)

It is worth noting that the difference between

{\tilde{e}}_{k}

and

λ

is that Equation (39) includes both

{\tilde{e}}_{k} < λ

and

{\tilde{e}}_{k} = λ

, whereas Equation (40) contains only those subject to condition of

{\tilde{e}}_{k} = λ

. Considering that we derive Equation (40) by computing the partial derivative of the Lagrangian function with respect to

α_{j}

, the sparse solution of

α_{j}

is dominated by Equation (40). Logically, the k-th sample

x_{k}

is selected as an SV only when

λ = {\tilde{e}}_{k}

strictly, and the contribution of the corresponding

α_{k}

for sparse LI-SVR should not be overlooked. That is to say, only a small fraction of all deviations of (39) satisfies the condition of

λ = {\tilde{e}}_{k}

. Similarly, we can conduct an analysis on the sparsity of the Lagrange multiplier

β_{k}

. From, Equations (23) and (30), we have,

\begin{matrix} \{\begin{matrix} λ = \max {- {\tilde{e}}_{k}, k = 1, 2, \dots, n} \\ k = 1, 2, \dots, n, λ > 0 . \end{matrix} \end{matrix}

(41)

It should be noted that from Equation (30),

λ

denotes the maximum value of all negative deviations at this point. Equation (41) allows us to ascertain an index, k, of all SVs in which its negative deviations

- {\tilde{e}}_{k}

equal exactly

λ

. It is evident that the numbers

n_{s v 1}

and

n_{s v 2}

of the SVs are markedly smaller than the total sample size n. Finally, the total number of SVs used in constructing the sparse LI-SVR is computed as

n_{s v}

(namely,

n_{s v 1} + n_{s v 2})

, where the size of coefficient

| α_{k} - β_{k} |

corresponding to these SVs is not equal to zero. The essence of the obtained sparse LI-SVR can be reflected through Theorem 2.

Theorem 2.

Among the Lagrange multipliers

α = (α_{1}, α_{2}, \dots, α_{n})

and

β = (β_{1}, β_{2}, \dots, β_{n})

, at most, one of

(α_{k}, β_{k})

is non-zero, and one of the following three conditions is always satisfied for

k = 1, 2, \dots, n

and

λ > 0

:

\begin{matrix} (a) : α_{k} > 0, β_{k} = 0 \\ (b) : α_{k} = 0, β_{k} > 0 \\ (c) : α_{k} = 0, β_{k} = 0 \end{matrix}

Proof.

According to the original constraints (26), it is apparent that any one in the Equations (42)–(44) remains valid,

\begin{matrix} y_{k} - w^{T} ϕ (x_{k}) - b = λ, \end{matrix}

(42)

\begin{matrix} - λ < y_{k} - w^{T} ϕ (x_{k}) - b < λ \end{matrix}

(43)

\begin{matrix} - y_{k} + w^{T} ϕ (x_{k}) + b = λ . \end{matrix}

(44)

From the Karush–Kuhn–Tucker (KKT) conditions, we have,

\begin{matrix} α_{k} (y_{k} - w^{T} ϕ (x_{k}) - b - λ) = 0 \end{matrix}

(45)

\begin{matrix} β_{k} (- y_{k} + w^{T} ϕ (x_{k}) + b - λ) = 0 . \end{matrix}

(46)

where

α_{k} \geq 0

,

β_{k} \geq 0

and

λ > 0

.

Equation (45) is evidently true when

α_{k} = 0

, whereas Equation (42) must hold when

α_{k} > 0

. Substituting it into Equation (46) results in

β_{k} \cdot (- 2 λ) = 0

, implying

β_{k} = 0

. Condition

(a)

subjected to both

α_{k} > 0

and

β_{k} = 0

in Theorem 2 is proven. In addition, Equation (46) also holds true when

β_{k} = 0

. If

β_{k} > 0

, then from Equations (44) and (45), it follows that

α_{k} \cdot (- 2 λ) = 0

, which leads to

α_{k} = 0

. This further confirms the condition

(b)

of Theorem 2. Similarly, when Equation (43) holds, both

α_{k} = 0

and

β_{k} = 0

must be met from Equations (45) and (46). We derive the third condition

(c)

of Theorem 2 again.

However, it is impossible for both

α_{k} > 0

and

β_{k} > 0

to occur simultaneously. If

α_{k} > 0

and

β_{k} > 0

are both met, then we have,

\begin{matrix} λ = y_{k} - w^{T} ϕ (x_{k}) - b \end{matrix}

(47)

\begin{matrix} λ = - y_{k} + w^{T} ϕ (x_{k}) + b, \end{matrix}

(48)

and it is inferred that

λ = 0

is contradictory to the known fact that

λ > 0

. This concludes the proof of Theorem 2. □

Although

λ

is influenced by the largest residual, the regularization term jointly constrains the model complexity and prevents overreaction to individual samples. This trade-off limits the sensitivity of

λ

to moderate outliers in practice, as confirmed by the experimental results.

In conclusion, we can clearly see that Theorem 2 further elaborates on how to obtain sparse solutions using the proposed method. Additionally, we also observe that the number of samples satisfying condition

(c)

in Theorem 2 is much higher than the one corresponding to the other two conditions

(a)

and

(b)

. Therefore, the proposed sparse LI-SVR method has a fundamentally evolutionary development over the traditional LSSVR presented by Johan Suykens and Joos Vandewalle in enhancing model sparsity and generalization ability.

Further, the proposed LI-SVR formulation explicitly minimizes a regularized upper bound of the prediction error, which inherently limits the influence of noise on the regression function. As further demonstrated in the experimental section, the model exhibits stable performance across different noise realizations and noise distributions, indicating robustness against stochastic disturbances.

3.4. Relationship Between the Global Variable $λ$ and the $ϵ$ -Insensitive Loss

In classical

ϵ

-SVR, sparsity is induced by the

ϵ

-insensitive loss, where prediction errors within a predefined tolerance band do not contribute to the objective function. In contrast, the proposed LI-SVR replaces the fixed

ϵ

-tube with a single global envelope variable

λ

, which is jointly optimized with the regression function. This subsection provides a formal discussion on the relationship between

λ

and

ϵ

.

For a fixed regression function

f (x) = w^{T} ϕ (x) + b

, the

ϵ

-SVR constraint enforces

| y_{k} - f (x_{k}) | \leq ϵ + ξ_{k},

(49)

where

ϵ

is user-specified and

ξ_{k}

are slack variables. In LI-SVR, the corresponding constraint can be written as

| y_{k} - f (x_{k}) | \leq λ,

(50)

where

λ

is an optimization variable shared by all samples.

It can be observed that, for any feasible solution of

ϵ

-SVR, choosing

λ = \max_{k} | y_{k} - f (x_{k}) |

yields a feasible solution for LI-SVR. Conversely, for a given LI-SVR solution

(f, λ)

, setting

ϵ = λ

leads to an

ϵ

-SVR feasible set that contains the LI-SVR solution as a special case. Therefore, LI-SVR can be interpreted as optimizing an upper bound of the

ϵ

-insensitive loss, where the tolerance parameter is no longer fixed a priori but is determined adaptively from the data.

More formally, the LI-SVR objective can be viewed as minimizing a regularized upper bound of the maximum absolute deviation, while

ϵ

-SVR minimizes the sum of deviations exceeding a fixed threshold. When the optimal solution of

ϵ

-SVR satisfies that all active errors lie exactly on the boundary of the

ϵ

-tube, the two formulations coincide with

λ = ϵ

. In general cases, LI-SVR yields a conservative envelope that guarantees bounded worst-case errors, whereas

ϵ

-SVR allows a trade-off between sparsity and tolerance through a predefined margin.

This analysis clarifies that

λ

should not be regarded as a direct replacement of

ϵ

, but rather as an adaptive envelope parameter that implicitly controls the insensitivity region through optimization.

In addition, we emphasize that interpreting

λ

as an “adaptive

ϵ

” is not intended to imply a strict theoretical equivalence. Instead, this interpretation arises naturally from the optimality conditions of the proposed formulation.

From the Karush–Kuhn–Tucker (KKT) conditions of the LI-SVR optimization problem, it follows that only samples satisfying,

| y_{k} - f (x_{k}) | = λ

(51)

are associated with nonzero dual variables and thus act as support vectors. All remaining samples strictly satisfy

| y_{k} - f (x_{k}) | < λ

(52)

and are inactive at the optimum. This property closely parallels the role of the

ϵ

-insensitive region in classical

ϵ

-SVR, where samples inside the

ϵ

-tube do not contribute to the solution.

However, a key distinction is that

λ

is not predefined but is jointly optimized together with the regression function. As a result, the effective insensitivity region is determined adaptively by the data distribution and regularization trade-off, rather than by user specification. From an optimization viewpoint,

λ

represents the smallest envelope that simultaneously satisfies all constraints while minimizing a regularized worst-case error.

4. Experimental Analysis

In comparison with LSSVR and SVR methods, the regression prediction using our approach has the following three significant characteristics: (1) for LSSVR, the proposed method not only achieves inherent sparsity but also outperforms SVR; (2) while SVR requires the optimization of three hyperparameters, the proposed method only optimizes two hyperparameters; (3) by introducing the infinite norm of error, the proposed method adaptively computes the maximum error, which is somewhat similar to the insensitive zone in SVR. Based on this, the following three metrics are introduced to demonstrate the effectiveness of the proposed method: the percentage of support vectors (

S V s %

), root mean square error (

R M S E

). The first metric is used to evaluate the complexity of the model structure, while the latter one is used to assess the accuracy of the model. Clearly, these metrics are conflicting, and the evaluation of regression prediction should strike a balance between them.

S V s %

,

R M S E

and

M A E

(maximum absolute error) are defined as, respectively,

\begin{matrix} S V s % = \frac{N_{S V}}{N} \times 100 % \end{matrix}

(53)

\begin{matrix} R M S E = \sqrt{\frac{1}{N} \sum_{j = 1}^{N} {(y_{k} - \hat{y_{k}})}^{2}}, \end{matrix}

(54)

here,

N_{S V}

denotes the total number of SVs, which is calculated by applying the conditions of

∣ α_{k} - β_{k} ∣ > 1 E - 8

, N represents the total number of training samples,

y_{k}

and

\hat{y_{k}}

are the actual and predicted output, respectively. MAE is reported to reflect the worst-case performance, which is particularly relevant for the proposed

L_{\infty}

-norm formulation. To demonstrate the rationality and superiority of the proposed method, we compare its performance with that of classical SVR and LSSVR on several benchmark datasets, including static nonlinear functions and linear/nonlinear dynamic systems. All experimental simulations were conducted in MATLAB 2022b, with the hyperparameters of SVR and LSSVR optimized using the built-in MATLAB packages and the LS-SVMlab Toolbox, respectively. Hyperparameter selection is not the primary focus of this study. In the experiments, the regularization parameter

γ

and kernel width

σ

are selected using grid search with cross-validation to ensure fair and stable performance across different datasets.

4.1. Computational Complexity and Scalability

The proposed LI-SVR is formulated as a convex quadratic programming problem with

2 n

inequality constraints, which is structurally simpler than the standard

ϵ

-SVR formulation involving

4 n

constraints. In terms of computational complexity, standard SVR requires solving a quadratic programming (QP) problem with

2 N

inequality constraints, resulting in a worst-case training complexity between

O (N^{2})

and

O (N^{3})

. LSSVR reduces this burden by solving a linear system of size N, leading to a complexity of

O (N^{3})

. The proposed LI-SVR also involves a convex QP problem, but without slack variables and with only

2 N

non-negativity constraints. In practice, the effective number of active constraints is significantly smaller due to sparsity, yielding training times comparable to or lower than standard SVR. In addition, LI-SVR requires only two hyperparameters, namely the kernel width and the regularization parameter, whereas classical

ϵ

-SVR introduces an additional

ϵ

-insensitive parameter. This reduction in hyperparameter dimensionality significantly alleviates the computational burden associated with hyperparameter optimization, especially when grid search or cross-validation is employed.

In terms of convergence behavior, LI-SVR can be efficiently solved using standard quadratic programming solvers and exhibits stable convergence due to its convex objective and reduced constraint set. Since the kernel matrix structure is identical to that of LSSVR and SVR, the memory footprint scales quadratically with the number of training samples, which is common to kernel-based methods.

Regarding scalability, LI-SVR inherits the same limitations and advantages as kernel SVR methods. While the current work focuses on moderate-scale datasets, the proposed formulation is compatible with existing large-scale extensions, such as decomposition methods or approximate kernel techniques, which can be incorporated in future work.

4.2. Static Nonlinear Functions

Initially, we evaluate the performance of our method by applying it to a simple regression task of the sinc function [44], which is defined as:

y = s i n c (x) = \frac{s i n x}{x} + ν, x \in U [- 4 π, 4 π]

(55)

where U denotes a random variable with a uniform distribution in the interval

[- 4 π, 4 π]

, and v is Gaussian noise with a mean of 0 and a variance of

{0.2}^{2}

. To mitigate potential bias in the regression prediction experiments, the mean value derived from ten independent runs for each dataset is used as the evaluation metric. Then, 500 training samples are generated from Equation (55) to establish the regression prediction model using the LI-SVR method. Likewise, 10 independent test datasets are generated from Equation (55), with each test dataset consisting of 500 samples. The results obtained for a classical SVR, LSSVR and LI-SVR (our approach) are listed in Table 1. For training RMSE, our method performs slightly worse than both SVR and LSSVR, while the testing RMSE is between the two. However, the sparsity of our method significantly outperforms them. The

S V s %

in our method is 2.60%, indicating that only 13 samples (as shown in Figure 3) out of 500 training samples were used to construct the regression prediction model, compared to 445 and 500 samples required by SVR and LSSVR, respectively. Figure 4 gives the indices of these SVs obtained by our method, along with the corresponding magnitudes. Clearly, while maintaining modeling accuracy, the proposed LI-SVR exhibits superior sparsity compared to the classical SVR and LSSVR in terms of model sparsity.

4.3. Linear Dynamic System

Consider a linear dynamic system represented by the following transfer function,

G (s) = \frac{C (s)}{R (s)} = \frac{20}{0.05 s^{2} + 3 s + 20}

(56)

with a sampling time

T_{s} = 0.01 s

, a total of

N = 500

samples, and input/output delays both set to 2. The input–output pairs of

{X (k), Y (k)}

for constructing LI-SVR are given as follows,

\begin{matrix} X (k) = {y (k - 1), y (k - 2), u (k - 1), u (k - 2)} \\ Y (k) = y (k), k = 3, 4, \dots, 500 \end{matrix}

(57)

here

u (k)

is a random signal defined as

u (k) = 2.0 \times r a n d

in our MATLAB simulation. Figure 5 shows the training output of LI-SVR with hyperparameter set

(σ, γ) = (0.8, 1000)

, where the blue solid circles represent the SVs. The corresponding magnitude of the LI-SVR model coefficients for these SVs is given in Figure 6, accounting for 10.08% of the total sample size, compared to 66.47% for SVR and 100% for LSSVR. The detailed training/testing RMSEs are presented in Table 2. Although LSSVR achieves the best modeling accuracy, its model structure is significantly more complex, reaching a magnitude of

1 \times 10^{7}

as indicated by the y-axis in Figure 7. Further, test data were generated based on Equation (57), and LI-SVR achieved favorable predicted outputs, as shown in Figure 8.

4.4. Nonlinear Dynamic System

We consider a nonlinear dynamic system [44],

\begin{matrix} y (k + 1) = 0.5 y (k) + 0.5 \frac{y^{2} (k - 1)}{1 + y^{2} (k - 1)} - 0.5 y (k - 1) u (k) + u (k) \end{matrix}

(58)

with a sampling time

T_{s} = 0.01 s

, a total of

N = 200

samples. The input–output pairs of

{X (k), Y (k)}

for constructing LI-SVR are the same as (57). When hyperparameter set

(σ, γ)

are chosen as

(0.2, 1000)

, the predicted output and SV are shown in Figure 9, where the indices associated with the SVs, along with the corresponding model coefficients, are illustrated in Figure 10. The predictive outputs derived from the test data, as governed by Equation (58), are presented in Figure 11. In Table 3, the modeling precision of the LI-SVR is observed to be intermediate between SVR and LSSVR, yet it demonstrates the most favorable sparsity. In particular, the proposed method constructs a regression model utilizing only 13 SVs out of 200 data points.

5. Conclusions

Although LSSVR greatly improves training efficiency by converting inequality into equality constraints, it also forfeits the crucial model sparsity of SVR. To deal with the fundamental flaw of LSSVR, we propose and thoroughly analyze another least infinity-norm (

L_{\infty}

-norm) version for LSSVR and SVR. Essentially, the proposed method exhibits excellent sparsity compared to traditional LSSVR, by adopting the

L_{\infty}

-norm and introducing new constraints to replace the

L_{2}

-norm and its constraints used in the original LSSVR optimization problem. Without the additional insensitivity margin

ϵ

and slack variables

ξ_{k}

seen in SVR, the number of hyperparameters in our approach is the same as LSSVR’s. Furthermore, another strength of using the idea of least

L_{\infty}

-norm is that it not only generates truly sparse properties in the regression model, but also automatically obtains the maximum approximation error

λ

reflecting the model accuracy.

Despite its advantages, the proposed method may face scalability challenges on extremely large datasets due to kernel matrix construction. In addition, its performance depends on kernel selection, and the

L_{\infty}

-norm criterion may be overly conservative in noise-dominated scenarios. Future work will focus on extending the proposed framework using low-rank kernel approximations or online learning strategies to improve scalability in high-dimensional or large-scale settings.

Author Contributions

Conceptualization, X.L., D.L. and C.Z.; methodology, X.L. and D.L.; validation, X.L., D.L. and C.Z.; formal analysis, C.Z.; writing—original draft preparation, X.L. and D.L.; writing—review and editing, X.L. and C.Z.; supervision, C.Z.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (61966006), the Key Laboratory Project of the Guizhou Provincial Department of Education (QJJ[2023]029), and the Zunyi Science and Technology Innovation Team Project (KCTD065), Moutai Institute Joint Science and Technology Research and Development Project (ZSKHHZ[2024] No. 384, ZSKHHZ[2024] No. 385, ZSKHHZ[2023] No. 123), Maotai College Academic New Seed Cultivation and Free Exploration Innovation Special Funding Project (myxm202304) and training program of high-level innovative talents of Moutai institute (mygccrc[2024]011, mygc[2024]012).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

$x_{k} \in R^{d}$	Input vector of the k-th training sample
$y_{k} \in R$	Output (target) value of the k-th training sample
n	Number of training samples
d	Dimensionality of the input space
$ϕ (\cdot)$	Nonlinear feature mapping induced by the kernel function
$D = {(x_{k}, y_{k})}_{k = 1}^{n}$	Training dataset
$w$	Weight vector in the feature space
b	Bias term of the regression model
$f (x) = w^{T} ϕ (x) + b$	Regression function
$e_{k}$	Approximation error of the k-th sample
$λ$	Maximum approximation error (optimization variable)
$γ > 0$	Regularization coefficient controlling model complexity
$K (x_{i}, x_{j})$	Kernel function evaluating inner products in feature space
$σ$	Kernel width parameter of the RBF kernel
${∥ \cdot ∥}_{2}$	Euclidean ( $ℓ_{2}$ ) norm
${∥ \cdot ∥}_{\infty}$	Infinity ( $ℓ_{\infty}$ ) norm, representing the maximum absolute value
$α_{k}$	Lagrange multiplier associated with the upper inequality constraint
$β_{k}$	Lagrange multiplier associated with the lower inequality constraint
$α$	Vector of Lagrange multipliers ${α_{k}}_{k = 1}^{n}$
$β$	Vector of Lagrange multipliers ${β_{k}}_{k = 1}^{n}$
$η$	Dual variable associated with non-negativity constraints
$J (\cdot)$	Objective function of the optimization problem
SV	Support vector (sample with nonzero dual variables)
RMSE	Root mean square error used for performance evaluation
MAE	Maximum approximation error
LSSVR	Least Squares Support Vector Regression
SVR	Standard Support Vector Regression
LI-SVR	Least $ℓ_{\infty}$ -norm regularized SVR

References

Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Burges, C.J.C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Zhao, Y.P.; Huang, G.; Hu, Q.K.; Li, B. An improved weighted one class support vector machine for turboshaft engine fault detection. Eng. Appl. Artif. Intell. 2020, 94, 103796. [Google Scholar] [CrossRef]
Kandukuri, S.T.; Senanyaka, J.S.L.; Huynh, V.K.; Robbersmyr, K.G. A two-stage fault detection and classification scheme for electrical pitch drives in offshore wind farms using support vector machine. IEEE Trans. Ind. Appl. 2019, 55, 5109–5118. [Google Scholar] [CrossRef]
Alamsabi, M.; Tchuindjang, M.; Brohi, S. Embedding-Based Detection of Indirect Prompt Injection Attacks in Large Language Models Using Semantic Context Analysis. Algorithms 2026, 19, 92. [Google Scholar] [CrossRef]
Ahmed, A.; Bosnić, Z. Machine learning for enabling high-data-rate secure random communication: SVM as the optimal choice over others. Mathematics 2025, 13, 3590. [Google Scholar] [CrossRef]
Colgan, M.S.; Baldeck, C.A.; Féret, J.B.; Asner, G.P. Mapping savanna tree species at ecosystem scales using support vector machine classification and BRDF correction on airborne hyperspectral and LiDAR data. Remote Sens. 2012, 4, 3462–3480. [Google Scholar] [CrossRef]
Chakrabarty, A.; Dinh, V.; Corless, M.J.; Rundell, A.E.; Zak, S.H.; Buzzard, G.T. Support vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences. IEEE Trans. Autom. Control 2016, 62, 135–148. [Google Scholar] [CrossRef]
Kirar, J.S.; Agrawal, R.K. Composite kernel support vector machine based performance enhancement of brain computer interface in conjunction with spatial filter. Biomed. Signal Process. Control 2017, 33, 151–160. [Google Scholar] [CrossRef]
Jiang, W.; Siddiqui, S. Hyper-parameter optimization for support vector machines using stochastic gradient descent and dual coordinate descent. EURO J. Comput. Optim. 2020, 8, 85–101. [Google Scholar] [CrossRef]
Azadeh, A.; Saberi, M.; Kazem, A.; Ebrahimipour, V.; Nourmohammadzadeh, A.; Saberi, Z. A flexible algorithm for fault diagnosis in a centrifugal pump with corrupted data and noise based on ANN and support vector machine with hyperparameters optimization. Appl. Soft Comput. 2013, 13, 1478–1485. [Google Scholar] [CrossRef]
Shakya, D.; Deshpe, V.; Safari, M.J.S.; Agarwal, M. Performance evaluation of machine learning algorithms for the prediction of particle Froude number (F_rn) using hyper-parameter optimizations techniques. Expert Syst. Appl. 2024, 256, 124960. [Google Scholar] [CrossRef]
Xu, H.; Chen, C.; Zheng, H.; Luo, G.; Yang, L.; Wang, W.; Wu, S.; Ding, J. AGA-SVR-based selection of feature subsets and optimization of parameter in regional soil salinization monitoring. Int. J. Remote Sens. 2020, 41, 4470–4495. [Google Scholar] [CrossRef]
Ramadevi, P.; Das, R. An extensive analysis of machine learning techniques with hyper-parameter tuning by Bayesian optimized SVM kernel for the detection of human lung disease. IEEE Access 2024, 12, 97752–97770. [Google Scholar] [CrossRef]
Ji, Y.; Xu, K.; Zeng, P.; Zhang, W. GA-SVR algorithm for improving forest above ground biomass estimation using SAR data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6585–6595. [Google Scholar] [CrossRef]
Li, Y.; Liu, G.; Jiao, L.; Marturi, N.; Shang, R. Hyper-parameter optimization using MARS surrogate for machine-learning algorithms. IEEE Trans. Emerg. Top. Comput. Intell. 2019, 3, 287–297. [Google Scholar] [CrossRef]
Zhu, F.; Gao, J.; Xu, C.; Yang, J.; Tao, D. On selecting effective patterns for fast support vector regression training. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 3610–3622. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, W.; Liu, X. Grid search with a weighted error function: Hyper-parameter optimization for financial time series forecasting. Appl. Soft Comput. 2024, 154, 111362. [Google Scholar] [CrossRef]
Mantovani, R.; Rossi, A.; Vanschoren, J.; Bischl, B.; Carvalho, A. Effectiveness of random search in SVM hyper-parameter tuning. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015. [Google Scholar]
Nugroho, A.; Suhartanto, H. Hyper-parameter tuning based on random search for DenseNet optimization. In Proceedings of the 2020 7th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE), Semarang, Indonesia, 24–25 September 2020. [Google Scholar]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
He, H.; Wang, K.; Jiang, Y.; Pei, H. Quadratic hyper-surface kernel-free large margin distribution machine-based regression and its least-square form. Mach. Learn. Sci. Technol. 2024, 5, 025024. [Google Scholar] [CrossRef]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Lahmiri, S.; Tadj, C.; Gargour, C.; Bekiros, S. Optimal tuning of support vector machines and k-NN algorithm by using Bayesian optimization for newborn cry signal diagnosis based on audio signal processing features. Chaos Solitons Fractals 2023, 167, 112972. [Google Scholar] [CrossRef]
Wicaksono, A.S.; Supianto, A.A. Hyper parameter optimization using genetic algorithm on machine learning methods for online news popularity prediction. Int. J. Adv. Comput. Sci. Appl. 2018, 12, 263–267. [Google Scholar] [CrossRef]
Korovkinas, K.; Danenas, P.; Garsva, G. Support vector machine parameter tuning based on particle swarm optimization metaheuristic. Nonlinear Anal. Model. Control 2020, 25, 266–281. [Google Scholar] [CrossRef]
Hsia, J.; Lin, C. Parameter selection for linear support vector regression. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5639–5644. [Google Scholar] [CrossRef]
Suykens, J.A.K.; Vewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
Suykens, J.A.K.; Lukas, L.; Van Dooren, P.; De Moor, B.; Vewalle, J. Least squares support vector machine classifiers: A large scale algorithm. In Proceedings of the European Conference on Circuit Theory and Design (ECCTD), Stresa, Italy, 29 August–2 September 1999. [Google Scholar]
Yan, H.; Qi, Y.; Ye, Q.L.; Yu, D.Y. Robust least squares twin support vector regression with adaptive FOA and PSO for short-term traffic flow prediction. IEEE Trans. Intell. Transp. Syst. 2021, 23, 14542–14555. [Google Scholar] [CrossRef]
Suykens, J.A.K.; Vewalle, J.; De Moor, B.L.R. Optimal control by least squares support vector machines. Neural Netw. 2001, 14, 23–35. [Google Scholar] [CrossRef]
Liu, X.; Dong, X.G.; Zhang, L.; Chen, J.; Wang, C.J. Least squares support vector regression for complex censored data. Artif. Intell. Med. 2023, 136, 102497. [Google Scholar] [CrossRef]
Goethals, I.; Pelckmans, K.; Suykens, J.A.K.; De Moor, B. Subspace identification of Hammerstein systems using least squares support vector machines. IEEE Trans. Autom. Control 2005, 50, 1509–1519. [Google Scholar] [CrossRef]
Suykens, J.A.K.; De Brabanter, J.; Lukas, L.; Vewalle, J. Weighted least squares support vector machines: Robustness and sparse approximation. Neurocomputing 2002, 48, 85–105. [Google Scholar] [CrossRef]
Suykens, J.A.K.; De Brabanter, J.; Lukas, L.; Vewalle, J. Benchmarking least squares support vector machine classifiers. Mach. Learn. 2004, 54, 5–32. [Google Scholar] [CrossRef]
Hu, L.; Yi, G.X.; He, C. A sparse algorithm for adaptive pruning least square support vector regression machine based on global representative point ranking. J. Syst. Eng. Electron. 2021, 32, 151–162. [Google Scholar] [CrossRef]
Yang, L.X.; Yang, S.Y.; Zhang, R.; Jin, H.H. Sparse least square support vector machine via coupled compressive pruning. Neurocomputing 2014, 131, 77–86. [Google Scholar] [CrossRef]
Hong, X.; Mitchell, R.; Fatta, G.D. Simplex basis function based sparse least squares support vector regression. Neurocomputing 2019, 330, 394–402. [Google Scholar] [CrossRef]
Mall, R.; Suykens, J.A.K. Very sparse LSSVM reductions for large-scale data. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 1086–1097. [Google Scholar] [CrossRef]
Ma, Y.; Liang, X.; Sheng, G.; Kwok, J.T.; Wang, M.; Li, G. Noniterative sparse LS-SVM based on globally representative point selection. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 788–798. [Google Scholar] [CrossRef]
Chen, D.-R.; Wu, Q.; Ying, Y.; Zhou, D.-X. Support vector machine soft margin classifiers: Error analysis. J. Mach. Learn. Res. 2004, 5, 1143–1175. [Google Scholar]
Ting, K.; Zhang, L.; Ge, X.; Lv, H.; Li, M. A robust least squares support vector machine based on L^∞-norm. Neural Process. Lett. 2020, 52, 2371–2397. [Google Scholar] [CrossRef]
Liu, X.; Yan, G.; Zhang, F.; Zeng, C.; Tian, P. Linear Programming-Based Sparse Kernel Regression with L₁-Norm Minimization for Nonlinear System Modeling. Processes 2024, 12, 2358. [Google Scholar] [CrossRef]

Figure 1. Concise structural diagram for constructing SVR and LSSVR.

Figure 2. The concise structural framework of the proposed method, illustrating the logical integration of sparsity and parameter efficiency.

Figure 3. A test output of the proposed LI-SVR.

Figure 4. The magnitude of coefficients and the indices of SVs obtained using our approach.

Figure 5. A test output of the proposed LI-SVR.

Figure 6. The magnitude of coefficients and the indices of SVs obtained using our approach.

Figure 7. The magnitude of coefficients obtained using LSSVR.

Figure 8. A test output of the proposed LI-SVR.

Figure 9. The training output and SV of the proposed LI-SVR.

Figure 10. The magnitude of coefficients and the indices of SVs obtained using our approach.

Figure 11. A test output of the proposed LI-SVR.

Table 1. The proposed LI-SVR is compared with the classical SVR and LSSVR in terms of training/testing RMSE and

S V s %