Reinforcement Procedure for Randomized Machine Learning

: This paper is devoted to problem-oriented reinforcement methods for the numerical implementation of Randomized Machine Learning. We have developed a scheme of the reinforcement procedure based on the agent approach and Bellman’s optimality principle. This procedure ensures strictly monotonic properties of a sequence of local records in the iterative computational procedure of the learning process. The dependences of the dimensions of the neighborhood of the global minimum and the probability of its achievement on the parameters of the algorithm are determined. The convergence of the algorithm with the indicated probability to the neighborhood of the global minimum is proved


Introduction
The beginning of this century has been marked by an increased interest in the problems of reinforcement learning. The essence of this branch of machine learning is to train an object (model, algorithm, etc.) by interacting not with a teacher but with an environment, using the trial-and-error method with reward or penalty depending on the results.
Let us look at this idea, abstracting from the specifics of the experiment, exclusively from the methodological point of view. Clearly, it represents a virtual game procedure where the game is simulated by two player-agents, their strategies, and quantitative assessments of their payoffs and losses. Reflecting on the peculiarities of learning processes, F. Rosenblatt, the author of the perceptron, introduced the concept of learning without a teacher and classified the types of structural tuning for playing automata [1].
The same concept can be traced in the paper [2] by I.M. Gelfand, I.I. Pyatetskij-Shapiro, and M.L. Tsetlin. The authors proposed a mathematical model of a game between two automata with a variable structure changing in the course of interaction with the environment. The interaction results were characterized by quantitative assessments.
Later, the response to the action of "environment" was given a particular term, the socalled "reinforcement." It became a whole branch in the theory and applications of machine learning. Admittedly, both focused on two problems, clustering (visualization) and pattern recognition. Such problems involve objects with their quantitative characteristics (feature), and, most importantly, the "distances" between them can be calculated. Some kinds of rewards or penalties in the algorithm parameters were arranged based on the distance matrix. Neural networks were used as algorithms [3]. In particular, the so-called "Kohonen maps" were one of the first research works in this area; for details, see [4]. In such maps, the weights of a neural network are adjusted using a game-theoretic model that implements the principle of competition between its nodes: an advantage is gained for the nodes with the minimum distance between the objects at each step of the algorithm.
Subsequently, reinforcement learning was actively developed based on the automata models of an object (agent) interacting with the environment in game-theoretic terms (strategies, utility functions, and payoffs). It was presented to the scientific community as a certain model of human management [5].
Numerous algorithms appeared with different models and volumes of a priori information about the environment, different methods for choosing strategies, and different procedures for forming utility functions. For example, we refer to [6][7][8]. A fairly comprehensive survey of reinforcement learning methods was prepared at the Department of Mathematical Methods of Forecasting (Faculty of Computational Mathematics and Cybernetics, Moscow State University) [9].
The general structure of reinforcement learning procedures is interpreted in terms of a Markov decision process, an extremely general construction of one-step iterations in continuous time t with feedback, accompanied by a specific terminology [10]. Its main components are an agent model with output (agent's action) and inputs in the form of environment states and rewards, current or averaged over a certain number of iterations, and an environment model with input (agent's action) and outputs in the form of specified rewards and responses (environment states). The fundamental feature of this procedure is the empirical estimation of the conditional probabilities of rewards for the agent's actions based on adjustable random Monte Carlo simulations. Such simulations (also called iterations or trials) are used to average a fixed number of current rewards or discount them. The resulting function depends on the environment state and the agent's strategy and is being taken as a utility function (an analog of the objective function in teacher-assisted learning procedures). During learning, this function is sequentially maximized [11,12] using Bellman's optimality principle [13] in its stochastic setting [14] (Many researchers of reinforcement learning interpret it as learning without a teacher. Indeed, this approach involves no goal-setting in the form of a teacher's error-and-response function to be minimized. However, the corresponding role is played by experimentally generated utility and reward functions, which represent a virtual "teacher." The structures of these functions and methods for calculating their mathematical expectations are based on experimental statistical material and expert opinions. Therefore, the results of using reinforcement learning often provoke discussions.).
In the papers [22,23], and the book [24], a new machine learning procedure (Randomized Machine Learning, RML) was developed. The basic concept of RML is based on the use of a parameterized model with random parameters, its optimization using the conditional information entropy maximization method, and the subsequent generation of random parameters with optimized probability density functions. According to this concept, it consists of three stages: analytical (determining the entropy-optimal probability density function of randomized model parameters and measurement noises consistent with empirical balances with the data), computational (solving the empirical balance equations numerically), and experimental (performing Monte Carlo simulations to reproduce random sequences with the entropy-optimal probabilistic characteristics).
Because all machine learning problems incorporate intrinsic uncertainty in models and data, it was proposed to maximize the informational entropy of probability density functions (PDFs) of the model parameters and measurement noises as a measure of uncertainty subject to empirical balances with real data. This is a functional entropy-linear programming problem of the Lyapunov type [24]. It has an analytical solution, i.e., the optimal PDFs parameterized by Lagrange multipliers, which are determined from the empirical balance equations.
They are specific nonlinear equations containing the so-called integral components (multidimensional parametrized definite integrals). Therefore, it is impossible to establish any fruitful properties of the equations that would ensure the convergence of iterative procedures for their solution.
In this paper, we employ the GFS method based on Monte Carlo batch iterations [25,26]. The basic method GFS (Generation, Filtration, Selection) is an improved method for finding an approximate value of the global minimum on a unit cube, with an estimate of the size of the neighborhood and the probability of reaching it.
A problem-oriented version of the reinforcement concept is being developed to fundamentally improve the computational properties of the GFS method and the RML procedure as a whole. We prove the theorem on the strict monotonic decrease of the residual function for a system of nonlinear equations of empirical balances in which only measurements of the values of the functions are available. The latter is used to study the convergence with probability 1 of an iterative procedure with reinforcement and to estimate the size of the neighborhood of the global minimum and the probability of reaching it with a finite number of iterations.
Therefore, our contribution to the theory and practice of RMS is to develop a reinforcement scheme that allows us to increase the computational efficiency of the procedure and prove its convergence to the neighborhood of the global minimum with a certain probability.

The Mathematical Model of the RML Procedure
We study the problem of learning the model of dependence between one-dimensional input and output data.
where ξ are left and right boundaries of the interval. The probabilistic properties of the measurement noises are characterized by PDFs Q k (ξ[k]), k ∈ N . Suppose that they are continuously differentiable.
The mathematical model of the general dynamic dependence with finite memory ρ is described by a functional B [24]: where parameters a = {a 1 , . . . , a m }. If the functional B is linear and continuous, it can be represented by a segment of the Volterra functional power series [24].
In the equality above, the parameters a are random and interval-type: The probabilistic properties of the parameters are characterized by a PDF P(a), which is supposed to be continuously differentiable as well.
The output of the model is observed with additive noises: Because the model parameters are random and measurements of the output are distorted by random noises, according to (1) and (4) it is generated ensembles of random To form the morphological properties of the PDFs, we adopt the numerical characteristics of ensembles based on moments and called normalized total moments: where s is a degree of the moment, and S is a number of moments.
The numerical characteristics (5) are the values of the normalized total moments along the trajectories of the observed model output. Output data u (s) [k], s ∈ [1, S], k ∈ N are assumed to be similar indicators of some real process: In particular, such properties are inherent in trading procedures for options [27,28].
In this case, the basic RML algorithm [24] has the following form: and -the empirical balance conditions Problem (8)-(10) has an analytical solution parameterized by Lagrange multipliers Λ = [λ s,k | s = 1, S; k ∈ N ]: The Lagrange multipliers figuring in these equations satisfy the empirical balance equations It can be seen from these equations that they contain the so-called integral components, namely, definite parametrized multidimensional integrals on m-dimensional parallelepipeds A (3). In general, it is possible to determine numerically only the values of the functions in which they are included. The latter excludes the possibility of a reasonable declaration of the properties of functions in the left parts of these equations.

The Adaptive Method of Monte Carlo Packet Iterations with Reinforcement (the GFS-RF Algorithm)
To solve these equations, in [26], the GFS algorithm was proposed, which is a modification of the random search method, in which the generation (G) of the number M i of random and independent points specified at each iteration step i on the unit cube in R m , filtering (F) "good" points, i.e., that fall into the admissible region, their selection (S) according to the values of the residual functional adopted for these equations. The convergence properties of GFS were based on the existence of certain functional properties of the functions involved in these equations. It is proposed to fundamentally modify this algorithm using the ideas of reinforcement.

The Canonical Form of the Problem
The system of formula (13) can be represented in the following form: where Λ = [λ s,k | s ∈ [1, S], k ∈ N ]-Lagrange multipliers matrix, and functions In the vectorization procedure [29], Equations (14) and (15) can be written as where the vector function φ, the variable λ, and the 0-vector on the right-hand side have the dimension d = S × (N + 1). The vector λ ∈ R d , i.e., its components take values −∞ < λ n < ∞.
We reduce problem (16) to the canonical form using the following change of variables: where b n is a parameter. This mapping changes the infinity interval to the interval [0, 1].
As a result, Equation (16) takes the form We introduce the residual function (the Euclidean norm) Solving Equation (18) is equivalent to finding points z * , in which the global minimal of the residual function J(z) is reached. Such an interpretation turns out to be fruitful since the global minimum is known: Thus, solving Equation (18) is reduced to finding the global minimum of a continuous function that is bounded below and algorithmically computable function values on the unit cube: Because the function J(z) is continuous and z ∈ Z d + , there exist its modulus of continuity ω(h) and positive constants (H, h): where the constants H, s are unknown. In order to use these constants to study the properties of the iterative process, we have to estimate them using only the values of the residual function.

Structure of Reinforcement Procedure
Let us introduce a useful terminological framework. The function J(z) is treated as an environment and its values J k (z (k) ) on iteration k are responses to the agent's strategy (action) z (k) . The quality of the environment response is assessed by a utility function Q(J), whose values on iteration k are Q k (J k ). The quality of the agent's actions (strategies) is characterized by a payoff function κ(Q).
The self-learning algorithm minimizing the residual function (21) based on Monte Carlo packet iterations has the following reinforcement scheme. Note that this algorithm enumerates in a controlled way the values of the residual function on the unit cube. Therefore, the reinforcement scheme is focused on learning rational controllability to accelerate the iterative process.
Agent. The agent's strategy on iteration k is to generate a packet of uniformly distributed random vectors on the unit cube. The strategy is characterized by the grid step η k and the number M k of random values for each component of the vector z from the interval [0, 1]. They have the relation where q is a fixed parameter. Due to this relation, let the agent's strategy be the value M k . On a given grid step, it is possible to generate a different number N k of independent random vectors (agent's strategies) with the uniform distribution on the cube Z d + : Assume that in this packet (As has been emphasized, we employ simple Monte Carlo simulations: the same number of independent random numbers with the uniform distribution on [0, 1] is generated for each coordinate of the original space), For each pair of the (k − 1)th and kth packets, the corresponding (k − 1, k)-records, and the decrements are calculated by the formulas and respectively. Utility function. The performance of the iterative process is characterized by the values of the decrements. Because the iterative process involves Monte Carlo simulations, the values u k appear to be random. To operate more reliable trend indicators of the iterative process, we organize m simple Monte Carlo simulations with M k (23) trials on each iteration k and compute the mean valuesū k (M k ): To describe the state of the iterative process, we adopt the concept of exponential comparative utility [30,31]. In this context, the utility function ϕ(ū k (M k )) is assumed continuously differentiable, positive, and monotonically decreasing in the variableū k : Following [30,31], we choose the exponential comparative utility function where η > 0 and γ > 0 are some parameters. Payoff function. In the concept of reinforcement, the payoff function reflects the dependence of the payoff r k on the utility ϕ k,k−1 (M k ). By assumption, the payoff grows with increasing the exponential comparative utility. Therefore, the payoff function satisfies the condition r k (ϕ k,k−1 | M k ) > 0 and is monotonically increasing in the variable ϕ k,k−1 , i.e., The reinforcement decision is taken after accumulating a given number L of the payoffs Q k by iteration k, i.e., the mean payoffQ k (M k ) over L iterations: The valueQ k (M k ) is an important characteristic of the reinforcement procedure and is used to optimize the main parameter of MC trials-the number of required random points at the (k + 1)-th iteration step.

Formation of the Monotonic Sequences of Records
Following the concept of reinforcement, we use a Markov iterative process for RML; the state of this process on iteration (k + 1) depends only on the state of the previous iteration (k.) To search for the agent's optimal strategy (the value M k+1 ), let us use Bellman's optimality principle [13]. In its extended interpretation, the agent's optimal strategy on iteration (k + 1) depends on the weighted optimal strategy on iteration k. This principle can be implemented within the additive (33) or multiplicative form of the algorithm. Here, α and γ are some parameters.

Remark 1.
Generally speaking, Bellman's optimality principle is only a declaration here, which sometimes may fail. In particular, learning processes and their internal mechanisms are underinvestigated, and they do not necessarily satisfy the Markov property. As a result, the agent's strategy on iteration (k + 1) can be formed from the weighted optimal strategies on iterations (k − s)−, . . . , k. For example, where α k−s , . . . , α k are some parameters.
The reinforcement procedure generates an optimal number M k+1 of random values for each iteration. The local record J * k+1 (M k+1 ) and the decrement u k+1 (M k+1 ) are then determined for the resulting value M k+1 . They are compared to their counterparts obtained on the previous iteration k. If the first record is smaller than the previous one, it becomes a member of the strictly monotonically decreasing sequence of local records. In this case, the sequence of decrements has strictly negative elements: Thus, the Reinforcement module has the logical diagram shown in Figure 1. Agent is the central block of this diagram. It generates the number M k+1 of random values on iteration (k + 1) as the sum of the number of random values on the previous iteration k and the optimized component ∆M k with the parameter α. At each iterative step, the Optimization block outputs the maximumQ k of the payoffs r k accumulated over L iterations (a fixed number), which are calculated in the Payoff function block. The necessary values of the comparative utility function ϕ k,k−1 , records J * k and J * k−1 , and their decrements u k and u k−1 are calculated in the Feedback block.
Thus, the described reinforcement procedure proves the following assertion: Let at each step of the iterative process of finding a solution to the Equations (14) and (16) a reinforcement procedure (30)-(32) is carried out, which implements the Belman optimality principle in the form (33) or (34).
Then, a strictly monotonically decreasing sequence of local records is generated: Note that the sequence of local records consists of random elements but satisfies the chain of inequalities (36).

The Probabilistic Characteristics of the Packet Z k
The iterative procedure is based on generating the packet Z d k of random and independent vectors with a uniform distribution on the unit d-dimensional cube. The source of this packet is a random generator that produces on each iteration k a (d × M k )-dimensional array of independent random variables with a uniform distribution on the interval [0, 1].
Consider the d-dimensional unit cube Z d + and the grid with step η k (23). The cube Z d + is the union of the elementary cubes with side M −q k . We estimate the probability P(M k , d, q) that each elementary cube will contain at least one of the random vectors from the packet Z d k generated on iteration k. where as M k → ∞.
Proof. Consider the partition of the interval [0, 1] by a grid with step η k 1 (23). At least one random value from M k = (1/η) 1/q will fall into the elementary interval with the probability η k . Let this grid be applied to all sides of the unit cube. Then the event A that at least one random vector from N k = M d k will fall into the elementary cube and has the probability η d k . Hence, the complementary event (not getting into the elementary cube) has the probability (1 − η d k ) N k . The upper bounds on the number of elementary intervals and the number of elementary cubes are (1 + M −q/2 k ) and (1 + M −q/2 k ) d , respectively. Therefore, the upper bound on the probability of the event A is given bŷ Due to the relation (25) between M k and N k , we finally arrive at the upper bound (38).
For large values M k = x, which yields (39).

The Probabilistic Properties of the Local Record Sequence (36)
The reinforcement procedure forms the strictly monotonically decreasing sequence J * of local records and the sequence of their arguments z * . Because of their strict monotonic decrease, it is more convenient to renumber the elements by integers 1, 2, . . . , i, . . . : Let Z denote the set of points z 0 corresponding to the zero value of the residual function: J(z 0 ) = J * = 0 (20). Due to the continuity of the function J(z), this set is compact. We introduce the distance between an arbitrary point in the cube and the set Z: The elements of the local record sequence are ordered but random values. Therefore, the deviation from the global record (the global minimum) takes a random value J * i on each iteration. Using the assumption that the residual function (19) has a modulus of continuity ω(H, h) (22), we can formulate the following Lemma 2.

Lemma 2.
For a finite number of iterative steps i with a probability not smaller than P 0 (M i , d, q) (38) and (39), we have the bilateral estimate where ω(H, h i ) denotes the modulus of continuity of the function J(z) (22), and h i = Proof. Consider the random points generated on iteration i among them, letẑ be the closest one to the set Z in terms of the distance (42). At least one of these points will fall with a probability not smaller than P 0 (M i , d, q) into each elementary cube with side M −q i ; see Lemma 1. Hence, This happens if the point z 0 corresponding to the zero value of the residual function is in the center of the elementary cube with side M −q i and its nearest random points are in the cube vertices so that each cube contains at least one random point.
By the Hölder condition (22), we have On the other hand, This chain of inequalities implies From (45) and (47) it follows that with a probability not smaller than P 0 (M i , d, q).
Inequality (48) provides an upper bound on the deviation from the zero value of the residual function on each iteration obtained after the reinforcement procedure and a lower bound on the probability P 0 (M i , d, q) (38) of its realization. The upper bound is the value of the modulus of continuity of the function J on these iterations. In other words, according to (22), With the notations we arrive at a very useful probabilistic form of inequality (49): It gives a lower bound on the probability that the current record will fall into the neighborhood of the global minimum as well as determines its size.

The Size of the Neighborhood of the Global Minimum
Consider a sequence of decrements on a finite number of iterations k: We represent the decrements as Due to (51) where, due to (33) The boundary value of the modulus of continuity of the decrement for k iterations is or, in the logarithmic scale, Thus, we have a linear dependence with unknown parameters log D and p, which are related to the parameters of the modulus of continuity (22). Their values can be estimated using the available data on log u * k and log M k by the least squares method. The parameters D and p determine the size of the neighborhood of the global minimum and the probability of reaching it (50) and (51).

Remark 2.
The upper bound (54) is very conservative: it focuses on estimating the elements of the local record sequence and neglects an essential feature of the decrement sequence. In the latter, the number of random values on iteration (k + 1) changes compared to iteration k due to the reinforcement procedure (33) and (34).
This feature is reflected in the expression for the decrement boundary value: where the reinforcement procedure (34) generates the values By analogy with (57), we obtain This dependence still has two parameters, D and p, but the data include log u * k , M k , andM k additionally generated by the reinforcement procedure. The dependence (60) is nonlinear. Its parameters can be restored using the least squares method as well. As in the previous case, however, there is no guarantee of obtaining the optimal result.

The Convergence of the GFS-RF Algorithm to the Global Minimum
The reinforcement procedure (30)- (33), combined with the selection of local records, makes their sequence the property of a strictly monotonic decrease (37), accompanied by a sequence of decrements with negative elements (36). Based on them, we can formulate the following Theorem 1.
Then the sequence of local records J * k = {J * 1 > J * 2 , > · · · > J * k } at k iterations achieve the area R * k with probability not less than Proof. The proof follows from Lemmas 1 and 2 and the estimate (51).

Discussion and Conclusions
The concept and computational procedure of Randomized Machine Learning proposed in [22] turned out to be very useful in terms of inaccurate data estimation probability distributions, and also an effective computer technique for solving many applied problems [24]. The modules of this procedure have been applied to practical problems of the randomized forecasting of World population [32], electrical load in the power systems [33], the evolution of the thermokarst lakes in the Arctic zone [34], randomized classification of the objects [35,36]. In these works, we used public datasets of the UN [37], and [38]. However, its practical application is associated with solving a very difficult problem of finding solutions to a specific system of nonlinear equations in which only the values of the functions included in it are available.
In this paper, we propose to use the idea of reinforcement to give adaptive properties to computational algorithms. A problem-oriented reinforcement procedure based on the agent-based approach is proposed, in which the agent generates a strategy in terms of the optimal number of random numbers generated at each step of the iterative process. As a utility function, the exponential comparative utility function is used, which depends on the average decrements of local records achieved at each main iteration. An important role in the reinforcement procedure is played by the payoff function, which generates "penalties" on the values of the utility function. Optimization of the agent's strategy is carried out using R. Belman's principle of optimality. As a result of applying the reinforcement procedure, the dimensions of the neighborhood of the global minimum of the quadratic residual function and the probability of its achievement with a finite number of iterations are determined.