Relaxation Subgradient Algorithms with Machine Learning Procedures †

: In the modern digital economy, optimal decision support systems, as well as machine learning systems, are becoming an integral part of production processes. Artiﬁcial neural network training as well as other engineering problems generate such problems of high dimension that are difﬁcult to solve with traditional gradient or conjugate gradient methods. Relaxation subgradient minimization methods (RSMMs) construct a descent direction that forms an obtuse angle with all subgradients of the current minimum neighborhood, which reduces to the problem of solving systems of inequalities. Having formalized the model and taking into account the speciﬁc features of subgradient sets, we reduced the problem of solving a system of inequalities to an approximation problem and obtained an efﬁcient rapidly converging iterative learning algorithm for ﬁnding the direction of descent, conceptually similar to the iterative least squares method. The new algorithm is theoretically substantiated, and an estimate of its convergence rate is obtained depending on the parameters of the subgradient set. On this basis, we have developed and substantiated a new RSMM, which has the properties of the conjugate gradient method on quadratic functions. We have developed a practically realizable version of the minimization algorithm that uses a rough one-dimensional search. A computational experiment on complex functions in a space of high dimension conﬁrms the effectiveness of the proposed algorithm. In the problems of training neural network models, where it is required to remove insigniﬁcant variables or neurons using methods such as the Tibshirani LASSO, our new algorithm outperforms known methods. 68T20


Introduction
In this study, which is an extension of previous work [1], a problem of minimizing a convex, not necessarily differentiable function f (x), x ∈ R n (where R n is a finitedimensional Euclidean space) is discovered. Such a problem is quite common in the field of machine learning (ML), where optimization methods, in particular, gradient descent, are widely used to minimize the loss function during training stage. In the era of the digital economy, such functions arise in many engineering applications. First of all, training and regularizing the artificial neural networks of a simple structure (e.g., radial or sigmoidal) may lead to the application of a loss function in a high-dimensional space, which are often non-smooth. When working with more complex networks, such functions can be non-convex.
While a number of efficient machine learning tools exist to learn smooth functions with high accuracy from a finite data sample, the accuracy of these approaches becomes less satisfactory for nonsmooth objective functions [2]. In a machine learning context, it is quite common to have an objective function with a penalty term that is non-smooth such as the Lasso [3] or Elastic Net [4] linear regression models. Common loss functions, such as the hinge loss [5] for binary classification, or more advanced loss functions, such as the one arising in classification with a reject option, are also nonsmooth [6], as well as some widely used activation functions (ReLU) [7] in the field of deep learning. Modern convolutional networks, incorporating rectifiers and max-pooling, are neither smooth nor convex. However, the absence of differentiability creates serious theoretical difficulties on different levels (optimality conditions, definition and computation of search directions, etc.) [8].
Modern literature offers two main approaches to building nonsmooth optimization methods. The first is based on creating smooth approximations for nonsmooth functions [9][10][11][12][13]. On this basis, various methods intended for solving convex optimization problems, problems of composite and stochastic composite optimization [9][10][11]14] were theoretically substantiated.
The second approach is based on subgradient methods that have their origins in the works of N. Z. Shor [15] and B.T. Polyak [16], the results of which can be found in [17]. Initially, relaxation subgradient minimization methods (RSMMs) were considered in [18][19][20]. They were later developed into a number of effective approaches such as the subgradient method with space dilation in the subgradient direction [21,22] that involve relaxation by distance to the extremum [17,23,24]. The idea of space dilation is to change the metric of the space at each iteration with a linear transformation and to use the direction opposite to that of the subgradient in the space with the transformed metric.
Embedding the ideas of machine learning theory [25] into such optimization methods made it possible to identify the principles of organizing RSMM with space dilation [26][27][28][29]. The problem of finding the descent direction in the RSMM can be reduced to the problem of solving a system of inequalities on subgradient sets, mathematically formulated as a problem of minimizing a quality functional. This means that a new learning algorithm is embedded into the basis of some new RSMM algorithm. Thus, the convergence rate of the minimization method is determined by the properties of the learning algorithm.
Rigorous investigation of the approximation capabilities of various neural networks has received much research interest [30][31][32][33][34][35][36] and is widely applied to problems of system identification, signal processing, control, pattern recognition and many others [37]. Due to universal approximation theorem [30], a feedforward network with a single hidden layer and a sigmoidal activation function can arbitrarily well approximate any continuous function on a compact set [38]. The studies of learning theory on unbounded sets can be found in [39][40][41].
The stability of the neural network solutions can be improved by introducing a regularizing term in the minimized functional [42], which stabilizes the solution using some auxiliary non-negative function carrying information on the solution obtained earlier (a priori information). The most common form of a priori information is the assumption of the function smoothness in the sense that the same input signal corresponds to the same output. Commonly used regularization types include: 1. Quadratic Tikhonov regularization (or ridge regression, R 2 ). In the case of approximation by a linear model, the Tikhonov regularizer [42] is used: where parameters of the linear part of the model are included, and k is the number of vector U components. The regularizer R 2 is mainly used to suppress large components of the vector U to prevent overfitting the model.
2. Modular linear regularization (Tibshirani Lasso, R 1 ). The regularizer proposed in [3] is mainly used to suppress large and small components of the vector U: 3. Non-smooth regularization (R γ ) [3,43]: The use of R γ led to the suppression of small ("weak") components of the vector U. This property of R γ enables us to reduce to zero weak components that are not essential for the description of data.
The aim of our work is to outline an approach to accelerating the convergence of learning algorithms in RSMM with space dilation and to give an example of the implementation of such an algorithm, confirming the effectiveness of theoretical constructions.
In the RSMM, successive approximations [18][19][20]26,28,44] are: where k is the iteration number, γ k is the stepsize, x 0 is a given starting point, and the descent direction s k+1 is a solution of a system of inequalities on s ∈ R n [20]: Hereinafter, (s, g) is a dot product of vectors, and G is a set of subgradients calculated on the descent trajectory of the algorithm at a point x k .
Denote as (s, g) a set of solutions to inequality (5), as ∂ f (x) is a subgradient set at point x. If the function is convex and G = ∂ ε f (x k ) is an ε-subgradient set at point x k , and s k+1 is an arbitrary solution of system (5), then the function will be reduced by at least ε after iteration (4) [20].
Since there is no explicit specification of ε-subgradient sets, the subgradients g k ∈ ∂ f (x k ) are used as elements of the set G, calculated on the descent trajectory of the minimization algorithm. These vectors must satisfy the condition: Inequality (6) means that for the vectors used, condition (5) is not satisfied. The choice of learning vectors is made according to this principle in the perceptron method [25,45], for instance.
A sequence of vectors g k ∈ ∂ f (x k ), k = 0, 1, . . . is not predetermined, but determined during minimization (4) with a built-in method for finding the vector s k+1 at each iteration of minimization by a ML algorithm.
Let vector s k+1 be a solution of the system of inequalities (5) for the subgradient set of some neighborhood of the current minimum x k . Then, as a result of iteration (4), we go beyond this neighborhood with a simultaneous function decrease, since vector s k+1 forms an acute angle with each of the subgradients of the set.
In this work, we present a formalized model of subgradient sets, which enables us, taking into account their specificity, to formulate stronger learning relations and quality functionals, which leads to acceleration in the convergence rate of learning algorithms designed to form the direction of descent in the RSMM.
As a result of theoretical analysis, an effective learning algorithm has been developed. For the proposed ML algorithm, the convergence in a finite number of iterations is proved when solving problem (5) on separable sets. Based on the learning algorithm, we proposed a method for minimizing nonsmooth functions. Its convergence on convex functions was substantiated. It is shown that on quadratic functions, the minimization method generates a sequence of minimum approximations identical to the sequence of the conjugate gradient method. We also proposed a minimization algorithm with a specific one-dimensional search method. A computational experiment confirmed its effectiveness for problems of neural network approximations, where the technology of uninformative model component suppression with regularizers similar to the Tibshirani LASSO was used [46]. The result of our work is an optimization algorithm applicable for solving neuron network regularization and other machine learning problems which, in turn, contains an embedded machine learning algorithm for finding the most promising descent direction. In the problems used for comparison, the objective function forms an elongated multidimensional ravine. As with other subgradient algorithms, our new algorithm proved to be efficient for such problems. Moreover, the new algorithm outperforms known relaxation subgradient and quasi-Newtonian algorithms.
The rest of this article is organized as follows. In Section 2, we consider a problem of acceleration of the learning algorithms convergence using relaxation subgradient methods. In Section 3, we make assumptions regarding the parameters of subgradient sets that affect the convergence rate of the proposed learning algorithm. In Section 4, we formulate and justify a machine learning algorithm for finding the descent direction in the subgadient method. In Section 5, we give a description of the minimization method. In Section 6, we establish the identity of the sequences generated by the conjugate gradient method and the relaxation subgradient method with space dilation on the minimization of quadratic functions. In Sections 7 and 8, we consider a one-dimensional minimization algorithm and its implementation, respectively. In Sections 9 and 10, we present experimental results for the considered algorithms. In Section 9, we show a computational experiment on complex large-sized function minimization. In Section 10, we consider experiments with training neural network models, where it is required to remove insignificant neurons. Section 11 contains a summary of the work.

Acceleration of the Learning Algorithm's Convergence in Relaxation Subgradient Methods
To use more efficient learning algorithms, a relation stronger than (5) can be written for the descent direction. We make an additional assumption about the properties of the set G. Assumption 1. Let a convex set G ⊂ R n belong to a hyperplane; its minimal length vector η is also the minimal length vector of this hyperplane. Then, a solution of the system (s, g) = 1 ∀g ∈ G is also a solution of (5) [26]. It can be found as a solution to a system of equations using a sequence of vectors from G [26]: It is easy to see that the solution to system (7) in s is the vector s * = η/||η|| 2 . Figure 1 shows the subgradient set in the form of a segment lying on a straight line. Equalities (7) can be solved by a least squares method. For example, using the quadratic quality functional it is possible to implement a gradient machine learning algorithm, where β k is a gradient method step. Hence, when choosing β k = 1/(g k , g k ), we obtain the well-known Kaczmarz algorithm [47], The found direction s k satisfies the learning equality (s k+1 , g k ) = 1. To ensure the possibility of decreasing the function as a result of iteration (4), the new descent direction in the minimization method must be consistent with the current subgradient, i.e., satisfy inequality (s k+1 , g k ) > 0. Process (8) corresponds to this condition. To propose a faster algorithm, consider the interpretation of process (8). Let Assumption 1 be fulfilled. The step of process (8) is equivalent to the step of one-dimensional minimization of the function Let the current approximation s k be obtained using a vector g k−1 and satisfy the condition (s k , g k−1 ) = 1. Figure 2 shows the projections of the current approximation s k and the required vector s * on the plane of vectors g k−1 , g k . Straight lines W 1 and Z 1 are hyperplane projections for vectors s, given by the equalities (s, g k−1 ) = 1 and (s, g k ) = 1. Vector s 1 k+1 is a projection of the approximation obtained from the iteration (8).
Projections of approximations s k+1 in the plane of vectors g k , g k−1 .
If (g k , g k−1 ) ≤ 0, then the angle between subgradients is obtuse, and the angle ϕ is acute ( Figure 2). In this case, it is possible to completely extinguish the projection of the residual between s k and s * in the plane of vectors g k−1 , g k , passing from point A to point C along vector AC, perpendicular to the vector g k−1 , i.e., along vector p k : . (9) In this case, the iteration has the form In Figure 2, this vector is denoted as s 2 k+1 . The vector s k+1 , obtained by formula (10), satisfies the equalities (s k+1 , g k−1 ) = 1, (s k+1 , g k ) = 1 and coincides with the projection s * of the optimum of the function E(s). At small angles φ between the straight lines W 1 and Z 1 , the acceleration of convergence for process (10) becomes essential. In this work, process (10) will be used to accelerate the convergence in the metric of the iterative least squares method.
Using learning information (7), one of the possible solutions to the system of inequalities (5) can be found in the form s k+1 = arg min s F k (s), where Such a solution can be obtained by the iterative least squares method (ILS). With weight factors w i = 1, after the arrival of new data q k , g k , the transition from the previously found solution s k to a new solution s k+1 in ILS is made as follows: Note that, in contrast to the usual ILS, there is a regularizing component ∑ n i=1 s 2 i /2 in F k (s), which allows us to use transformations (11) and (12) from the initial iteration, setting s 0 = 0 and H 0 = I.
In [26], based on ILS (11) and (12), an iterative process is proposed for solving the system of inequalities (5) using learning information (7): Here, α k > 1 is a space dilation parameter. Consider the rationale for the method of obtaining formulas (13) and (14). Using processes (11) and (12) for scaled data, we obtain where scaling factor q > 0. The latter is equivalent to introducing the weight factors w k = 1/[q(g k , H k g k )] in F(s). Then, after returning to the original data g k , y k , we obtain the expressions: The transformation of matrices (16) is practically equivalent to (14) after the appropriate choice of the parameter q. For transformation (15), the condition (s k+1 , g k ) > 0 providing the condition for the possibility of decreasing the function in the course of iteration (4) along the direction s k+1 may not be satisfied. Therefore, the transformation (13) is used with q k = 1, which ensures equality (s k+1 , g k ) = 1 > 0. Transformation (13) can be interpreted as the Kaczmarz algorithm in the corresponding metric. As a result, we obtain processes (13) and (14). Methods [18][19][20] of the class under consideration possess the properties of the conjugate gradient method. The noted properties expand the area of effective application of such methods.
The higher the convergence rate of processes (13) and (14), the greater the value of the permissible value α k in (14) [26], which depends on the set G's characteristics. In algorithms (13) and (14), we distinguish 2 stages: the correction stage (13). reducing the residual between the optimal solution s * and the current approximation s k , and the space dilation stage (14), resulting in the increase in the residual in the transformed space without exceeding its initial value, which limits the magnitude of the space dilation parameter. To create more efficient algorithms for solving systems of inequalities, we have to choose the direction of correction in such a way that the reduction of the residual is higher than that of process (13). The direction of space dilation should be chosen so that it becomes possible to increase the space dilation parameter value due to this choice.
This paper presents one of the special cases of the implementation of the correction stage and space dilation stage. It was proposed to use linear combinations of vectors g k−1 , g k in transformation (13) instead of a vector g k when it is appropriate: Transformations (17) and (18) are similar to the previously discussed transformations (9) and (10) carried out in the transformed space.
In the matrix transformation, we use equation (14) instead of vector g k . We also use vector y k = g k − g k−1 such that As shown below, the discrepancy between the optimal solution s * and the current approximation s k+1 along the vector y k is small, which makes it possible to use large parameters of space dilation α k in (19). Iterations (18) and (19) are conducted under the condition: In the next section, assumptions will be made regarding the parameters of subgradient sets that affect the convergence rate of the proposed learning algorithm and determine the permissible parameters of space dilation. This allows us to formulate and justify a machine learning algorithm for finding the descent direction in the subgradient method. Note that the described parameterization of the sets does not impose any restrictions on the subgradient sets but is used only for the purpose of developing constraints on the learning algorithm parameters.

Formalization of the Separable Sets Model
In this section, we will use the following notation (as earlier, vector η is the shortest vector from G): (1) ρ = ||η|| is the length of the minimal vector of the set; (2) R = max g∈G ||g|| is the length of the maximal vector of the set; (3) µ = η/ρ is the normalized vector η; (4) s * = µ/ρ is a vector associated with the sought solution of systems (5) and (7) when analyzing the ML algorithm; (5) R s = max g∈G (µ, g) is an upper-bound value of the set G in the direction µ; (6) M = R s /ρ is the ratio of the upper and lower bounds of the set along µ, r = ρ/R s = M −1 ; (7) V = ρ/R is the ratio of the minimal and maximal vectors of the set.
For some set Q, we will use the noted characteristics indicating the set as an argument; for example, η(Q), r(Q).
We assume that set G satisfies the assumption: ). Set G is convex, closed, limited (R < ∞) and satisfies the separability condition, i.e., ρ > 0. Vector s * is a solution to the system of inequalities (5), and ρ and R s describe the thickness of the set G in the direction µ: Due to (21) and the form of s * : R s , according to its definition, satisfies the constraints: Figure 3 shows a separable set and its characteristics. The R s characteristic determines the thickness of the set G and significantly affects the convergence rate of learning algorithms with space dilation. When the thickness of the set is equal to zero, R s = ρ and a flat set takes place (Figure 1). Set G and its characteristics with boundaries (22) are shown in Figure 3.
For example, consider the function The point x 0 is located at the bottom of a multidimensional ravine. Let us study its subgradient set at point x 0 . As earlier, g(x 0 ) is subgradient of a function. Components of subgradient vectors at non-zero values x i are as follows: g i (x) = sign(x i )a i . For zero x i , the components of the subgradient vectors belong to the set g i (x) ∈ [−a i , a i ], where the zero component g i (x) = 0 exists. Hence, it follows that the subgradient of the minimum length of function (23) at point x 0 has the form η = g 0 min = (sign(b 1 )a 1 , . . . , sign(b m )a m , 0, . . . , 0). Maximum length subgradients are specified by a set of subgradients It is easy to verify that the projections of arbitrary subgradients at the point x 0 onto vector g 0 min are the same. Therefore, the thickness of the subgradient set is zero. In a sufficiently small neighborhood of the point x 0 , the union of the subgradient sets of function (23) coincides with the subgradient set at the point x 0 . Consequently, the descent direction in the form of a solution to the system of inequalities (7) enables us to go beyond this neighborhood. Figure 4 shows a representation of the subgradient set of a two-dimensional piecewise linear function In quadrants, the function has the form Its subgradients at point x 0 are given as follows: Minimum length vector η = (−1, 0) T .
For large values of the parameter a in (24), the complexity of solving the minimization problem increases significantly.

Machine Learning Algorithm with Space Dilation for Solving Systems of Inequalities on Separable Sets
In this section, we briefly introduce an algorithm for solving systems of inequalities from [1] and theoretically justify iterations (18) and (19). Specific operators will be used for transformations (13), (14) and (17)- (19). Denote by S(s, g, p, H) transformation (18)'s operator, where the correspondence is used for the eponymous components. Then, for example, Formula (13) can be represented as s k+1 = S(s k , g k , g k , H k ). Similarly, for (14) and (19), we introduce the operator H(H, α, g). Formula (19) can be represented as For a chain of approximations s k , we form the residual vector ∆ k = s * − s k . Until vector s k is not a solution to (5), for vectors g k selected at step 2 of Algorithm 1, from (6) and (22), the following inequality holds: The transformation equation for matrix A k is as follows [1]: where A k = H −1 k . For vectors s k and g k of Algorithm 1: where A 1/2 A 1/2 = A, and A > 0 is a symmetric, strictly positive, definite matrix.

Algorithm 1
Method for solving systems of inequalities.
1: Set k = 0, s 0 = 0, g −1 = 0, H 0 = I. Set α > 1 as the limit for choosing the admissible value of the parameter α k for transformations (13) and (14) 2: Find g k ∈ G, satisfying the condition (6) (s k , g k ) ≤ 0 3: If such a vector does not exist, then solution s k ∈ S(G) is found; stop the algorithm. end if 4: If k = 0 or condition (20) Compute the limit of the admissible values of the space dilation parameter α yk for the combination of transformations (18) and (19) 6: If α 2 yk ≥ α 2 , then set α k satisfying the inequalities α 2 ≤ α 2 k ≤ α 2 yk and perform transformation (19) compute the limit of the admissible values of the space dilation parameter α gk for the combination of transformations (18), (14); set α k satisfying the inequalities α 2 ≤ α 2 k ≤ α 2 gk and perform transformation (14) H k+1 = H(H k , α k , g k ). Go to step 8 end if 7: Set α 2 k = α 2 and perform transformations (13), (14) s k+1 = S(s k , g k , g k , H k ), H k+1 = H(H k , α k , g k ) 8: Increase k by one and go to step 2 Inequality (28) is essential in justifying the convergence rate of the methods we study for solving systems of inequalities (5). The main idea of the algorithm formation is the point that the values of (∆ k , A k ∆ k ) do not increase when the values of (g k , H k g k ) decrease with a geometric progression speed. In such a case, after a finite number of iterations, the right side of (28) becomes less than one. The resulting contradiction means that problem (5) is solved, and there is no more possibility of finding a vector g k satisfying condition (6).
For the decreasing rate of the sequence {τ k }, τ k = min 0≤j≤k−1 [(g j , H j g j )/(g j , g j )], the following theorem is known [26]. (14) result with H 0 = I, α k = α > 1 and arbitrary g k ∈ R n , g k = 0, k = 0, 1, 2, . . . . Then This theorem does not impose restrictions on the choice of vectors g k . Therefore, regardless of which of equations (14) or (19) is used to transform the matrices, the result (29) is valid for a sequence of vectors composed of g k or y k , depending on which one of them is used to transform the matrices. Let us show that for Algorithm 1 with fixed values of parameter α k , estimates similar to (29) are valid, and we obtain expressions for the admissible parameters α k in (14), (19), at which the values (∆ k , A k ∆ k ) do not increase.
In order to obtain the visually analyzed operation of the algorithm, similarly to the analysis of iterations of the process (9), (10) carried out on the basis of Figure 2, we pass to the coordinate systemŝ = A 1/2 k s. In this new coordinate system, corresponding vectors and matrices of iterations (13), (14) and (18), (19) are transformed as follows [1]: For expressions (17), (18) and (27): Inequality (22) for new variables is: In Figure 5, characteristics of setĜ in the plane Z formed by the vectorsg k ,g k−1 are shown. Straight lines W 1 , W M are projections of hyperplanes, i.e., corresponding inequality (37) boundaries for possible positions of the vectorŝ * projections defined by the normalg k−1 . Straight lines Z 1 , Z M are boundaries of inequality (37) for vectorŝ * possible projection positions defined by the normalg k .
Let ψ be the angle between vectorsg k andg k−1 . Since in Figure 5, this angle is obtuse, then condition (20) holds: Consequently, angle ϕ in Figure 5 is acute. Due to the fact that vectorsg k ,g k−1 are normals for the straight lines W 1 and Z 1 , we obtain the relations [1]: Figure 5. Characteristics of the set G in the plane of vectorsg k ,g k−1 .
The following lemmas [1] allow us to estimate the admissible values of space dilation parameters.

Lemma 2.
Let vectors p 1 , p 2 and g be linked by equalities (p 1 , g) = a, (p 2 , g) = b. Let the difference of vectors p 2 − p 1 be collinear to the vector p, and let ξ be an angle between vectors p and g; then: Lemma 3. As a result of transformation (13) at step 7 of Algorithm 1, the following equality holds: and as a result of (18) at step 5, (42) will hold, and the equality is as follows: Lemma 4. Let set G satisfy Assumption 2. Then, the limit α of the admissible parameter value (13) and (14) is Lemma 5. Let set G satisfy Assumption 2. Then, the limit α gk of the admissible parameter value α k at step 5 of Algorithm 1, providing inequality (14) is given by the equation: where Lemma 6. Let set G satisfy Assumption 2. Then, the limit α yk for the admissible value of parameter α k at step 5 of Algorithm 1 providing inequality (18) and (19) is given as where The value cos 2 ϕ is defined in (39), and sin 2 ϕ = 1 − cos 2 ϕ.
In matrix transformations (14) and (19) of Algorithm 1, vectors g k and y k are used, which does not allow for directly using estimate (29) of Theorem 1 in the case when expression τ k involves some vector y m , m < k. In the next theorem, an estimate similar to (29) is obtained directly for subgradients g k generated by Algorithm 1.

Theorem 2.
Let set G satisfy Assumption 1 and let the sequence {π k = min 0≤j≤k−1 (g j , H j g j ) = (g Jk , H Jk g Jk )} be calculated based on the characteristics of Algorithm 1 for fixed values of the space dilation parameters α 2 k = α 2 specified at steps 5 and 6, where parameter α is specified according to (44). Then: where m = arg min 0≤j≤k−1 (g j , H j g j ).
Theorem 3. Let set G satisfy Assumption 1 and let the sequence {(∆ k , A k ∆ k )} be calculated based on the characteristics of Algorithm 1. Let dilation parameter α satisfy constraint (44) and let the admissible value α 2 yk be given by (47). Then: For fixed values of the space dilation parameter with respect to the convergence of Algorithm 1, the following theorem holds. Theorem 4. Let set G satisfy Assumption 2. Let the values of the space dilation parameters in Algorithm 1 specified at steps 5 and 6 be fixed as α 2 k = α 2 , and let parameter α be given according to constraint (44). Then, the solution to system (5) will be found by Algorithm 1 in a finite number of iterations, which does not exceed K 0 , the minimum integer number k satisfying the inequality Herewith, until a solution s k / ∈ S(G) is found, and the following inequalities hold: According to (21), parameters ρ and R s characterize the deviation of the component vectors g ∈ G along the vector µ. If ρ = R s , there is a set G in plane with normal µ. Such a structure of the set G allows one to specify large values of the parameter α (44) in Algorithm 1 and, according to (52), to obtain a solution in a small number of iterations.
In the minimization algorithm (4) on the descent trajectory, due to the exact onedimensional search, it is always possible to choose a subgradient from the subgradient set satisfying condition (6), including at the minimum point. Therefore, we will impose constraints on the subgradient sets of functions to be minimized, similar to those for the set G. Due to the biases in the minimization algorithm, we need to define the constraints, taking into account the union of subgradient sets in the neighborhood of the current minimum point x k , and use these characteristics based on Theorem 4 results to develop a stopping criterion for the algorithm for solving systems of inequalities.

Minimization Algorithm
Since in the subgradient set at point x k+1 an exact one-dimensional search is performed,there is always a subgradient satisfying condition (6): (s k+1 , g k+1 ) ≤ 0. For example, for smooth functions, the equality (s k+1 , g k+1 ) = 0 holds. Therefore, vector g k+1 can be used in Algorithm 1 to find a new descent vector approximation. In the built-in algorithm for solving systems of inequalities, the dilation parameter is chosen to solve the system of inequalities for the union of subgradient sets in some neighborhood of current approximation x k . This allows the minimization algorithm to leave the neighborhood after a finite number of iterations.
Due to possible significant biases during the operation of the minimization algorithm, the shell of the subgradient set involved in the learning may contain a zero vector. To avoid situations when there is no solution similar to (5) for the subgradient set from the operational area of the algorithm, we introduce an update to the algorithm of solving systems of inequalities. To track the updates, we used a stopping criterion, formulated based on Theorem 4's results.
To accurately determine the parameters of the algorithm involved in the calculation of the dilation parameters α yk and α gk , we define their calculation in the form of operators. Denote by AL 2 g (M, H, g −1 , g) the operator of calculation α 2 gk according to (45) and (46) in Lemma 5, which is α 2 gk = AL 2 yg (M, H k , g k−1 , g k ). For α 2 yk 's calculation according to expressions (47)- (49) in Lemma 6, we introduce operator AL 2 yg (M, H, g −1 , g), where parameters H, g −1 , g correspond to set H k , g k−1 , g k .
A description of the minimization method is given in Algorithm 2.
Set σ > 0, parameters M > 0, r = 1/M and the limit α for the dilation parameter according to equality (44). Compute Compute the limit of the admissible value of the dilation parameter α 2 yk = AL 2 y (M, H k , g k−1 , g k ) for the combination of transformations (18) and (19) compute the limit of the admissible value of the dilation parameter (18) and (14), set α k satisfying the inequalities α 2 ≤ α 2 k ≤ α 2 gk and perform transformation (14) H k+1 = H(H k , α k , g k ). Go to step 8 end if 7: Set α 2 k = α 2 and perform transformations (13), (14) s k+1 = S(s k , g k , g k , H k ), H k+1 = H(H k , α k , g k ) 8: Find a new approximation of the minimum point x k+1 = x k − γ k s k+1 , γ k = arg min γ f (x k − γs k+1 ) 9: Compute subgradient g k+1 ∈ ∂ f (x k+1 ), based on the condition (g k+1 , s k+1 ) ≤ 0. 10: Increase k and l by one and go to step 2 At step 9, due to the condition of the exact one-dimensional descent at step 8, the sought subgradient always exists. This follows from the condition for the extremum of the one-dimensional function. For the sequence of approximations of the algorithm, due to the exact one-dimensional descent at step 8, the following lemma holds [20].

Lemma 7.
Let function f (x) be strictly convex on R n , let set D(x 0 ) be limited, and let the sequence be a minimum point of function and let x * be limit points of the sequence {w q } generated by Algorithm 2 (RA(α)). The existence of limit points of a sequence {w q } when the set D(x 0 ) is bounded follows from w q ∈ D(x 0 ). Concerning the convergence of the algorithm, we formulate the following theorem [1].
Theorem 5. Let function f (x) be strictly convex on R n ; let set D(x 0 ) be bounded, and for x = x * , where parameters M, r and α of Algorithm 2 are given according to the equalities and parameters α k , set at steps 5 and 7, are fixed as α k = α. In this case, if σ = (3V 0 /4) 2 , then any limit point of the sequence {w q } generated by Algorithm 2 (RA(α)) is a minimum point on R n .

Relationship between the Relaxation Subgradient Method and the Conjugate Gradient Method
Let us establish the identity of the sequences generated by the conjugate gradient method and the relaxation subgradient method with space dilation on the minimization of quadratic functions: Suppose that the number of iterations without updating the relaxation subgradient method and the conjugate gradient method is the same and equal to m ≤ n. Denote by ∇ f (x) the function gradient at point x. Denote by g k = ∇ f (x k ) the gradients at the points of the current minimum approximation x k and denote by x 0 ∈ R n the initial point. Iterations of the subgradient method for the purposes of its comparison with the conjugate gradient method are briefly represented in Algorithm 3 (SSG). As we will see below, metric transformations for establishing the identity of the approximation sequences of the subgradient method and the method of conjugate gradients on quadratic functions are not essential. Therefore, in this scheme, there are no details of the choice of space dilation transformations and their parameters. where 2.2 Perform metric transformation according to one of the formulas (14) or (19) according to Algorithm 2 (RA(α)): where z k = g k in the case of transformation (14) and z k = g k − g k−1 in the case of transformation (19) 2.3 Execute the descent step: Iterations of the conjugate gradient method are carried out according to the Algorithm 4 scheme (see, for instance, [17]):

endfor
The following theorem gives conditions for the equivalence of methods for minimizing quadratic functions: Theorem 6. Let function f (x) be quadratic, and for the SSG process matrix H 0 = I, α k ≥ 1. Then, if the initial points for the SSG and SG processes coincide, x 0 =x 0 , and for any m > k ≥ 0, until a solution is found, the following sequences coincide: i.e., both processes generate coincident successive approximations.
Since x l =x l , vectors s l ands l are collinear, then by virtue of the condition of exact one-dimensional descentx l+1 = x l+1 . For the conjugate gradient method, the gradient vectors calculated during the operation of the algorithm are mutually orthogonal [17]. Therefore, by virtue of (62), all of the vectors g 0 , g 1 , . . . , g l will be mutually orthogonal. Using the recursive transformation of inverse matrices for (60), we obtain an expression for the matrix A l , Since in (60), vectors z k = g k or z k = g k − g k−1 , then, in this expression, all vectors z k , k = 0, 1, . . . , l − 1 participating in the formation of matrix A l are orthogonal to vector g l . Therefore, A l g l = g l . This implies the equality H l g l = g l . Due to the orthogonality of the vectors g l , g l−1 , according to (59), equality p l = g l holds. By virtue of the condition of the exact one-dimensional descent, (s l , g l ) = 0. In view of the above, from (58), we obtain: s l+1 = s l + H l g l (g l , H l g l ) =s l (g l−1 , g l−1 ) + g l (g l , g l ) .
Multiplying the last equality by (g l , g l ), we obtain a proof of equality (63) for k = l + 1: s l+1 (g l , g l ) =s l (g l , g l ) (g l−1 , g l−1 ) + g l =s l+1 .
Note. For an arbitrary initial matrix H 0 > 0, one should pass to the coordinate system, where the initial matrixH 0 = I, and in the new coordinate system, one should use the results of Theorem 6.
The presented RSMM with space dilation in an exact one-dimensional search has the properties of the conjugate gradient method. On a quadratic function, it is equivalent to the conjugate gradient method.

One-Dimensional Search Algorithm
Consider a one-dimensional minimization algorithm for process (4). Computational experience shows that the one-dimensional search algorithm in relaxation subgradient methods should have the following properties: 1.
An overly accurate one-dimensional search in subgradient methods leads to poor convergence. The search should be adequately rough.

2.
At the iteration of the method, the search step should be large enough to leave a sufficiently large neighborhood of the current minimum.

3.
To ensure the position of the previous paragraph, it should be possible to increase the search step faster than the possibilities of decreasing it.

4.
In PCM, a one-dimensional search should provide over-relaxation, that is, overshoot the point of a one-dimensional minimum along the direction of descent when implementing iteration (4). This provides condition (6), which is necessary for the learning process.

5.
When implementing one-dimensional descent along the direction at iteration (4), as a new approximation of the minimum, one can take the points with the smallest value of the function, and for training, one can take the points that ensure condition (6).
We use the implementation of the one-dimensional minimization procedure proposed in [26]. The set of input parameters is {x, s, g x , f x , h 0 }, where x is the current minimum approximation point, s is the descent direction, h 0 is the initial search step, and f x = f (x), g x ∈ ∂ f (x); moreover, the necessary condition for the possibility of decreasing the function along the direction (g x , s) > 0 must be satisfied. Its output parameters are {γ m , f m , g m , γ 1 , g 1 , h 1 }. Here, γ m is a step to the new minimum approximation point where γ 1 is a step along s, such that at the point z + = x − γ 1 s for subgradient g 0 ∈ ∂ f (z + ), inequality (g 0 , s) ≤ 0 holds. This subgradient is used in the learning algorithm. The output parameter h 1 is the initial descent step for the next iteration.
Step h 1 is adjusted in order to reduce the number of calls to the procedure for calculating the function and the subgradient.
In the minimization algorithm, vector g 0 ∈ ∂ f (z + ) is used to solve the system of inequalities, and point x + = x − γ m s is a new minimum approximation point.
Denote the call to the procedure of one-dimensional minimization as OM({x, s, g x , f x , h 0 }; {γ m , f m , g m , γ 0 , g 0 , h 1 }). Here is a brief description of it. We introduce the one-dimensional function ϕ(β) = f (x − βs). To localize its minimum, we take an increasing sequence β 0 = 0 and β i h 0 q i−1 M with i ≥ 1. Here, q M > 1 is step increase parameter. In most cases, q M = 3 is set. Denote z i = x − β i s, r i ∈ ∂ f (z i ), i = 0, 1, 2, . . . ; l is the number i at which the relation (r i , s) = 0 holds. Let us determine the parameters of the localization segment [γ 0 , γ 1 ] of the one-dimensional minimum γ 0 = β l−1 , f 0 = f (z l−1 ), g 0 = r i−1 , γ 1 = β l , f 1 = f (z l ), g 0 = r l and find the minimum point γ * using a cubic approximation of function [48] on the localization segment using the values of the one-dimensional function and its derivative. Compute if l = 1 and γ * ≤ 0.1γ 1 , The initial descent step for the next iteration is defined by the rule h 1 = q m h 0 (γ 1 /h 0 ) 1/2 . Here, q m < 1 is the parameter of the descent step decrease, which, in most cases, is given as q m = 0.8. In the overwhelming majority, when solving applied problems, the set of parameters {q M = 3, q m = 0.8} is satisfactory. When solving complex problems with a high degree of level surface elongation, the parameter q m → 1 should be increased.

Implementation of the Minimization Algorithm
Algorithm 2 (RA(α)), as a result of updates at step 3, loses information about the space metric. In the proposed algorithm, the matrix update is replaced by the correction of the diagonal elements, and exact one-dimensional descent by approximate. Denote by Sp(H) the matrix H's trace and denote by ε H the limit of admissible decrease in the matrix H's trace. The algorithm sets the initial metric matrix H 0 = I. Since, as a result of transformations, the elements of the matrix decrease, then when the trace of the matrix decreases, Sp(H) ≤ ε H , it is corrected using the transformation H + = nH/Sp(H), where ε H is a lower bound for trace reduction, and n is the space dimension. As an indicator of matrix degeneracy, we use the cosine of the angle between vectors g and Hg. When it decreases to a certain value ε λ , which can be done by checking (g, Hg) ≤ ε λ ||g||||Hg||, transformation H + = H + 10ε λ I is performed. Here, I is the identity matrix and ε λ is the cosine angle limit. To describe the algorithm, we will use the previously introduced operators.
Let us explain the actions of the algorithm. Since s 0 = 0, then at k = 0, condition (6) is satisfied, and (s k , g k ) ≤ 0 and g O 0 = g 0 . Therefore, at step 8, learning iterations (13) and (14) will be implemented. According to the algorithm of the OM procedure, the subgradient g O k+1 obtained at step 10 of the algorithm satisfies the condition (6): (g O k+1 , s k+1 ) ≤ 0. Therefore, at the next iteration, it is used in learning in steps 6-8.
At step 9, an additional correction of the descent direction is made in order to provide the necessary condition (g k , s k+1 ) > 0 for the possibility of descent in the direction opposite to s k+1 . From the point of view of solving the system of inequalities, this correction also improves the descent vector, which can be shown using Figure 5. Here, as under the conditions of Lemma 4, the movement is made in the direction AB, not from point A, but from some point of the segment AB, where (s 1/2 k+1 , g k ) < 1. Since the projection of the optimal solution is in the area between the straight lines Z 1 , Z M , the shift to point B, where (s k+1 , g k ) = 1, reduces the distance to the optimal vector.
An effective set of parameters for the OM in the minimization algorithm is {q m = 0.8, q M = 3}. The next section presents the results of numerical studies of the presented Algorithm 5.
When testing the methods, the values of the function and the subgradient were computed simultaneously. Parameter ε for quadratic functions was chosen as a sufficiently small value (10 −10 ); for non-smooth functions, it was chosen so that the accuracies in terms of variables are approximately the same for different types of functions. Tables 1-6 show the number of calculations of the function values and the subgradient values necessary for achieving the required accuracy for the function f (x k ) − f * ≤ ε. The initial point of minimization x 0 and the value ε are given in the description of the function.
The test case contains quadratic and piecewise linear functions. Due to their simplicity and unambiguity, an analysis of the level surface elongation can be carried out easily. This choice of functions is due to the fact that, during minimization, the local representation in the current minimization area, as a rule, has either a quadratic or piecewise linear representation.
Functions 1 and 2 are quadratic, where the ratio of the minimum to maximum eigenvalue is 1/n 6 . The ratio of the level surface range along the coordinate axes of the minimum to the maximum is equal to 1/n 3 . Function 2, in comparison with function 1, has a higher density of eigenvalues in the region of small values. Function 3 is smooth with a low degree of variation of the level surface elongation. Its complexity is due to the degree above quadratic. Functions 4 and 5 are piecewise. For these functions, the ratio of the level surface range along the coordinate axes of the minimum to the maximum is equal to 1/n 3 . the same as for quadratic functions 1 and 2. It is of interest to compare the complexity of minimizing smooth and nonsmooth functions by nonsmooth optimization methods provided that their ratios of the surface range are identical.
None of problems 1, 2, 4 and 5 can be solved by the multistep minimization method [24] for n ≥ 100, which emphasizes the relevance of methods with a change in the space metric, in particular, space dilation minimization algorithms capable of solving nonsmooth minimization problems with a high degree of level surface elongation.
In order to identify the least costly one-dimensional search in the quasi-Newtonian method, it was implemented in various ways when specifying the initial unit step. Due to the high degree of condition number for functions 1 and 2, for the best of them, the costs of localizing a one-dimensional minimum when minimizing function 1 include about 2-4 steps. This, together with the final iteration of the approximation and finding the minimum on the localized segment, adds up to 3-5 calculations of the function and the gradient. For function 2, the total iteration costs are 5-10 calculations of the function and the gradient. Tables 1-3 for the QN method show only the number of iterations required to solve the problem.  According to the results of Tables 1 and 2, the RA(α = const) algorithm outperforms the RSD and r OM (α) methods on smooth functions. Therefore, the changes in the directions of correction and space dilation have a positive effect on the convergence rate of the new algorithm. In the RA(α k ) algorithm, compared to RA(α = const), an additional factor of convergence acceleration is involved due to an increase in the space dilation parameter, which, according to the results of Tables 1 and 2, led to an increase in the RA(α = const) algorithm's convergence rate.
For function 2, the eigenvalues of the Hessian are shifted to the small values area, which has a positive effect on the convergence rate of subgradient methods. Here, the quasi-Newtonian method QN, taking into account the costs of localizing the minimum in a one-dimensional search, required a larger number of calculations of the function and gradient.  148  365  295  440  400  221  200  395  505  638  600  258  248  409  702  833  800  295  280  421  900  1030  1000  336  317  433  1094  1205 For function 3, the number of iterations of the quasi-Newtonian method turned out to be higher than the number of calculations of the function and the gradient of subgradient methods. Based on the results of minimizing functions 1-3, we can conclude that subgradient methods with space dilation can also be useful in minimizing smooth functions. New methods RA(α k ) and RA(α = const) show better results here than other algorithms with space dilation.  According to the ratio of the level surface range along the coordinate axes, functions 1, 2, and 4 are similar. Function 4 is difficult to minimize by subgradient methods. Comparing the results of Tables 1 and 4, we can note insignificant differences in the convergence rate of subgradient methods on these functions, which is additional evidence of the method's effectiveness in solving nonsmooth optimization problems.
To simulate the presence of the thickness of the subgradient set when minimizing function 4, the subgradients g(x) ∈ ∂ f (x) in the process of minimization were generated with interference according to the g(x) ∈ (1 + ξ)∂ f (x), where ξ ∈ [0, 1] is a uniformly distributed random number. The interference negatively affects both the quality of the one-dimensional search and the quality of the descent direction. The results are shown in Table 5 The ratios of the level surface range along the coordinate axes for functions 5 and 1 are similar. The results for function 5 are shown in Table 6. Here, the RSD method has an advantage due to the fact that the function is separable and all of its subgradients calculated in the minimization procedure are directed along the coordinate axes. Space dilations occur along the coordinate axes, which does not change the eigenvectors of the metric matrix directed along the coordinate axes. To simulate the presence of the thickness of the subgradient set when minimizing function 5, the subgradients g(x) ∈ ∂ f (x) in the process of minimization were generated with interference according to the g(x) ∈ (1 + ξ)∂ f (x), where ξ ∈ [0, 1] is uniformly distributed random number. The results for function 5 with subgraduient distortion are shown in Table 7.
Here, the maximum possible value of the characteristic of a subgradient set M = R s /ρ = 2. The proposed methods show better results here than the RSD and r OM (α) methods. A number of conclusions can be drawn regarding the convergence rate of the presented methods: 1.
Functions 1, 2, 4, 5 have a significant degree of level surface elongation. The problems of minimizing these functions could not be solved by the multistep minimization methods investigated in [24], which emphasize the relevance of developing methods with a change in the space metric, in particular, space dilation minimization algorithms capable of solving nonsmooth minimization problems with a high degree of level surface elongation.

2.
Based on the results of minimizing smooth functions 1-3, we can conclude that subgradient methods with space dilation can also be useful in minimizing smooth functions. At the same time, the new algorithms RA(α k ) and RA(α = const) also show significantly better results on smooth functions than other subgradient RSD and r OM (α) methods.

3.
The new methods RA(α k ) and RA(α = const) significantly outperform the RSD and r OM (α) methods when minimizing nonsmooth functions. In the RA(α k ) algorithm, in comparison with the RA(α = const) algorithm, an additional factor of convergence acceleration is involved due to an increase in the space dilation parameter, which also leads to a significant increase in the convergence rate.

Computational Experiment Results in Approximation by Neural Networks
The purpose of this section is to demonstrate the usefulness of applying the methods of nonsmooth regularization (for example, the "Tibshirani lasso" [3]) to the problems of the elimination of uninformative variables when constructing mathematical models, where a necessary element of the technology is rapidly converging nonsmooth optimization methods applicable to minimize nonsmooth nonconvex functions. In this section, we will give several examples of approximation by artificial neural networks (ANN) using nonsmooth regularization to remove uninformative neurons. To assess the quality of this approximation technology using nonsmooth regularization, the obtained approximation results are compared with the previously known results. In each of the examples, a study of the effectiveness of the presented nonsmooth optimization methods will be carried out.
Consider the approximation problem . . , N are observational data, R i (w) are different kinds of regularizers, α are regularization parameters, f (x, w) is an approximating function, x ∈ R p is a data vector, w ∈ R n is a vector of the tunable parameters, and p and n are their dimensions. Formulas (1)-(3) can be used as regularizers.
Suppose that in the problem of approximation by a feedforward network, it is required to train a two-layer sigmoidal neural network of the following form using data D (i.e., evaluate its unknown parameters w) For the sigmoidal network where x j are components of vector x ∈ R p , w = ((w (2) i , i = 0, . . . , m), (w (1) ij , j = 0, . . . , p, i = 1, . . . , n)) is a set of parameters, the total number of which is denoted by n, ϕ(s) is a neuron activation function, and m is the number of neurons. The unknown parameters w must be estimated by the least squares method (64) using one of the regularizers R i (w). To solve problem (64), we use subgradient methods.
In a radial basis function (RBF) network, we will use the following representation of a neuron where x j are components of vector x ∈ R p , and the network parameters will be as follows: . . . , m)). One of the goals of our study is to compare the effectiveness of subgradient methods in solving the problem of approximating a two-layer sigmoidal ANN under conditions of reducing the number of excess neurons using various regularization functionals. To assess the quality of the solution, we will use the value of the root-mean-square error: (y − f (x, w)) 2 /N on a test sample of data D = DT 10.000 uniformly distributed in Ω.
In the algorithm we use (Algorithm 6), at the initial stage, an approximation of the ANN is found with a fixed position of the neurons' working areas using the specified centers c i ∈ R p , i = 1, 2, . . . , m in the approximation area defined by the data. By neuron working area, we mean the area of significant changes in the neuron activation function. The need for fixation arises due to the possible displacement of the working areas of neurons outside the data area. As a result, the neuron in the data area turns into a constant. For the RBF networks (65) and (67), this is easy to do, since the parameters of the centers are present in expression (67). For RBF networks (65) and (66), instead of (66), the following expression will be used: where vector w components do not contain free members. In this case, some center c i is located on the central hyperplane of the working band of a sigmoidal neuron. Centers c i inside the data area can be found by some data clustering algorithm x i ∈ R p , i = 1, . . . , N, which will ensure that neurons are located in areas with high data density. We use the maximin algorithm [45] in which two data points that are maximally distant from each other are selected as the first two centers. Each new center is obtained by choosing data point x i , the distance from which to the nearest known center is at its maximum. The resulting centers are mainly located on the edges of the data area. Computational experience shows that the use of the k-means method turns out to be ineffective, or effective with a small number of iterations.

Algorithm 6
Training Algorithm 1: On the data D, using the maximin algorithm, form the centers c i ∈ R p , i = 1, 2, . . . , m, where m is the initial number of neurons. Set the regularization parameter α and the type of regularizer R i (w). Using a generator of uniformly distributed random numbers for each neuron, determine the initial parameters of the ANN. After solving problem (4) with fixed centers, it is necessary to return to the original description for the sigmoidal network in the forms (65) and (66). This can be done through the formation of a free member of the neuron while leaving the other parameters unchanged. Such an algorithm for finding the initial approximation of the sigmoidal ANN guarantees that the data area will be covered by the working areas of neurons.
Here is a brief description of the algorithm for constructing an ANN. The algorithm first finds the initial approximation for fixed working areas of neurons and then iterates the removal of insignificant neurons, which is followed by training the trimmed network.
With a limited number of data, ANN f (x, W k ) with a number of parameters n not exceeding N and the smallest value of S k = S(D, f k ) is selected as the final approximation model.
Consider examples of solving approximation problems. Tables 8-10 show the value of S(DT 10.000, f ) calculated during the operation of the network learning algorithm with the sequential removal of insignificant neurons after network training at step 3.3. The first row of each table contains the function f i (x) to be approximated, the initial number of neurons m0, number of training data N0, the type of regularizer, the regularization parameter α, and the index deduced by rows. The first two columns indicate the number of neurons and the number of ANN parameters. The remaining columns show the values of the index for the tested methods. The values of the index with the limiting order of accuracy are highlighted in bold. This allows one to see a segment of maintaining a high quality of the network with a decrease in the number of neurons for each of the presented minimization algorithms. For some of the tables, the last row contains the minimum value of the maximum network deviation for the constructed models on the test data ∆ min . The dimensions of problems where the number of model variables exceeds the number of data are underlined. Standard least squares practice recommends having more data than the parameters to be estimated. Good values of the index for this case emphasize the role of regularization in approximation by neural networks.
In [49], in the domain Ω = [−1, 1] × [−1, 1] the function f 6 (x) = sin(πx 2 1 ) sin(2πx 2 )/2 was approximated by the cascade correlation method using data uniformly distributed at N = 500. The maximum deviation of the ANN obtained in [49] with the number of neurons m = 41 on the test sample of 1000 data was ∆ ≈ 0.15. Such a result is practically very difficult to obtain using the standard ANN learning apparatus. The authors in [49] did not succeed in obtaining such a result without first fixing the position of the working areas of neurons at the initial stage of the approximation algorithm. In our work, we obtained a series of ANN models with a smaller number of network training data N = 150 and with an assessment of the results on a test sample DT 10.000 consisting of 10.000 data with good approximation quality (∆ < 0.09 for selected index values). For example, for some of the constructed models, ∆ ≈ 0.02, which is almost an order of magnitude less than the result from [49]. Table 8 shows the results of the approximation of the function f 6 (x) using a smooth regularizer R 2 (w). The quasi-Newtonian method (QN) was also used here. Using the RA(α k ), RA(α = const) and RSD methods, it is possible to obtain a better quality approximation with a smaller number of neurons. The QN method is inferior in approximation quality to subgradient methods. Note that in some cases, the number of network parameters exceeds the number of data. At the same time, the network quality index is not worse than in the area with m < 38. For the methods r OM (α) and QN, the best indexes are in the m > 37 area. Table 8. Results of function f 6 (x) approximation by a sigmoidal ANN with a regularizer R 2 (w), m0 = 50, N0 = 150, α = 10 −7 , S(DT 10.000, f ). The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.  Table 9 shows the results of approximating function f 6 (x) by the sigmoidal ANN using a nonsmooth regularizer R 1 (w) ("Tibshirani lasso" technique [3]). Here, the trend in the relative efficiency of the methods continues. Using the RA(α k ), RA(α = const) and RSD methods, it is possible to obtain a better-quality approximation with a smaller number of neurons. Table 9. Results of function f 6 (x) approximation by a sigmoidal ANN with a regularizer R 1 (w), m0 = 50, N0 = 150, α = 10 −7 , S(DT 10000, f ). The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.  Table 10 shows the results of approximating function f 6 (x) by the sigmoidal ANN using a nonsmooth regularizer R γ (w). Using the RA(α k ), RA(α = const) and RSD methods, it is possible to obtain a better quality approximation with a smaller number of neurons. When using a nonsmooth regularizer to approximate a function, it is possible to obtain an ANN with good approximation quality characteristics with a smaller number of neurons.
Based on the results of Tables 8-10, it can be concluded that the use of regularizers makes it possible to obtain a qualitative approximation in the case when the number of parameters of the neural network function exceeds the number of data.
In [49], on the data at N = 625, formed in the domain Ω = [−3, 3] × [−3, 3], the generator of uniform random numbers approximated the function: The maximum deviation of the ANN constructed in [49], based on RBF, on a test sample of 1000 data was ∆ 1 000 = 0.06. Function f 3 is a typical example of a convenient radial basis function for approximating a network. In this work, we obtained a series of ANN models based on RBF with a smaller number of network training data N = 150 and with an assessment of the results on the test sample DT 10.000 consisting of 10.000 data, with good quality of approximation. For example, several of the constructed models give a value that is an order of magnitude smaller: ∆ 10.000 = 0.0024 (see Table 11). Table 11 shows the value of the index S(DT 10.000 , f ) calculated during the operation of the network learning algorithm with the sequential removal of insignificant neurons after training the network at step 3.3. The initial number of neurons is 36. The first two columns indicate the number of neurons and the number of ANN parameters. The last row of the tables shows the maximum deviation of the network for the constructed models on the test data ∆ min . On this function, the methods RA(α k ), RA(α = const) and RSD turned out to be equivalent in quality of approximation. Table 10. Results of function f 6 (x) approximation by a sigmoidal ANN with a regularizer R γ (w), m0 = 50, N0 = 150, α = 10 −7 , S(DT 10.000, f ). The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline. with the number of data N = 100. In this case, the achieved value of the root-meansquare error on the training sample is S(D 100 , f ) = 10 −6 [50]. We have built a number of sigmoidal ANNs with several orders of magnitude lower value of the quality index on a test sample. The values of the index S(D 10.000 , f ) = 10 −6 on the test sample, depending on the number of neurons in the ANN, are given in Table 12. Here, as earlier, algorithms RA(α k ) and RA(α = const) manage to obtain a longer series of ANN models with good quality of approximation. In this section, the ANN training technology was presented, where nonsmooth optimization methods are its integral component. A specific feature of the ANN approximation problems is the absence of convexity of the minimized functions. The fastest methods RA(α k ) and RA(α = const) turn out to be more effective in solving problems of ANN approximation and make it possible to obtain models with a smaller number of neurons. Nevertheless, the experience of solving similar problems of approximation suggests that when solving an applied problem, it is better to have several alternatives for choosing a minimization method.

Conclusions
The statement of the problem consisted in the construction of a rapidly converging algorithm for finding the descent direction in the minimization method, which forms an obtuse angle with all subgradients of some neighborhood of the current minimum, forming a separable set. Minimization along such a direction allows the algorithm to go beyond this neighborhood. This is the problem of constructing a separating hyperplane between the origin and a separable set, the normal of which is the desired direction. As a result, we have a problem of solving a system of inequalities, for the solution of which learning algorithms can be applied, for example, the perceptron learning algorithm [45].
Formalization of the subgradient sets model made it possible to reduce the problem of solving a system of inequalities to an approximation problem, for the solution of which an algorithm with space dilation was proposed, which is ideologically close to the iterative least squares method. Taking into account the peculiarities of subgradient sets makes it possible to improve the standard ILS scheme and obtain an effective rapidly converging iterative method for finding the descent direction in the minimization method based on subgradients obtained in the process of the one-dimensional search.
The new algorithm for solving inequalities was theoretically substantiated, and an estimate of its convergence rate was obtained depending on the parameters of the subgradient set. On this basis, a new subgradient minimization method was developed and justified. On quadratic functions, the proposed method has the properties of the conjugate gradient method. The outlined approach to creating learning algorithms can be used to develop new learning algorithms with space dilation for relaxation subgradient minimization.
A practically implementable version of the minimization algorithm has been developed, which uses a rough one-dimensional search. The performed computational experiment on complex large-sized functions confirms the effectiveness of the proposed relaxation subgradient minimization method.
The possibility of using the relaxation subgradient minimization method in solving nonsmooth non-convex optimization problems makes it possible to use it in problems of neural network training, where it is required to remove insignificant variables or neurons by methods similar to the Tibshirani lasso. Algorithms of this type are of great practical importance due to their high convergence rate and the possibility of using them to minimize non-convex functions, for example, when estimating the parameters of mathematical models under conditions of nonsmooth regularization, used for the purpose of model feature reduction [3,29,46]. The effectiveness of using the proposed relaxation subgradient minimization method in one of these technologies has been demonstrated in the present work.