All articles published by MDPI are made immediately available worldwide under an open access license. No special
permission is required to reuse all or part of the article published by MDPI, including figures and tables. For
articles published under an open access Creative Common CC BY license, any part of the article may be reused without
permission provided that the original article is clearly cited. For more information, please refer to
https://www.mdpi.com/openaccess.
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature
Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for
future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive
positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world.
Editors select a small number of articles recently published in the journal that they believe will be particularly
interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the
most exciting work published in the various research areas of the journal.
This paper is devoted to a novel formal analysis, optimizing the learning models for feedforward multilayer neural networks with hybrid structures. The proposed mathematical description replicates a specific switched-type optimal control problem (OCP). We have developed an equivalent, optimal control-based formulation of the given problem of training a hybrid feedforward multilayer neural network, to train the target mapping function constrained by the training samples. This novel formal approach makes it possible to apply some well-established optimal control techniques to design a versatile type of full connection neural networks. We next discuss the irrelevance of the necessity of Pontryagin-type optimality conditions for the construction of the obtained switched-type OCP. This fact motivated us to consider the so-called direct-solution approaches to the switched OCPs, which can be associated with the learning of hybrid neural networks. Concretely, we consider the generalized reduced-gradient algorithm in the framework of the auxiliary switched OCP.
Although it has been almost forty years since the multilayer perceptron (MLP) was proposed [1], it is still widely used as a full-connection classifier along with deep convolution neural networks. MLPs constitute fully-connected feedforward neural networks. It was proven that MLPs are universal approximators [2,3]. The importance of MLPs can never be overstated, since they are key classifier models due to their expressive power guaranteed by the universal approximation theorem [3]. However, the important issue is the generalization property of a well-trained network [4]. During the 1990s, balancing between the bias and variance of the network, which is achieved by appropriate choice of the number of layers and the number of nodes, was an important research topic issue (see, e.g., [5] and references therein). This issue was mitigated by regularizing the connection weights.
However, a formal method to address the problem of an optimal configuration of the network architecture, to determine the number of hidden layers and their constituent number of units, was never properly addressed. A generic MLP consists of computational layers, e.g., , with the same computational model at every layer, but with different number of hidden units. The connection weights between layers are the parameters of the model. A hybrid MLP with trainable parameters involves layers with different activation functions. Except for the general convention of using a larger number of computing nodes for layers closer to the input, there is no hypothesis on how the layers with different activation functions should be organized in order to obtain an optimal overall performance. Of course, various optimization criteria can be defined. Two basic optimization criteria are a faster convergence and better generalization of the mapping function.
For our analysis, considering layers as , we assume that the number of layers T is fixed, and that every layer has an equal number of units, also called nodes. A node can be disabled (ineffective) by setting all its connecting weights as zeros. We can drop a layer by making it a identity function, for which the output is the same as the input. If we start with a sufficiently large T, we can always converge to the optimal neural network using our algorithm. We always start with the generic model, and update the network parameters to find an optimal configuration for the hybrid feedforward network for the task.
For a concrete solution to optimize the neural network configuration, i.e., the network architecture, we will use a specific optimal control methodology that was initially developed for the optimal design of a switched control system. A conventional MLP neural network can be interpreted as a discrete-time dynamic system [6,7] (and references therein). Despite the impressive results seen in the real-world applications, some basic formal aspects of the analysis of modern feedforward deep network models remain unsolved. This fact motivates the development of possible equivalent interpretations of generic neural network dynamics in the form of some known mathematical abstractions. This equivalent representation of the hybrid MLP neural network in the form of a discrete-time control system makes it possible to apply the switched optimal control methodology to the main optimal structuring problem mentioned above.
Generally, the optimal control approach is not adequate to optimize feedforwardrd deep neural networks, or their variants as in [8]. In this paper, we reduce the optimal configuration problem for a hybrid multilayer neural network model to a specific switched-type OCP. We then discuss some critical aspect of the Pontryagin optimality principle in the framework of the resulting OCP and develop a reduced-gradient method for this problem.
The novelty of the proposed model is its hybrid computational architecture. For the proposed architecture, all layers are fully connected. At different layers, the activation functions could be different. They are automatically chosen from a set of possibilities to optimize the objective fuction defined in Equation (6). T is the number of layers set initially, and is fixed. Refs. [6,7] proposed training neural networks using tools from optimum control. In their proposals, all layers use the same computational model. This greatly limits the flexibility of the model to optimize. In the proposed hybrid network presented herein, although the number of layers is set to a fixed T, the computation q at different layers automatically converges to optimum, choosing the best one from a variety of Q possibilities. For example, the proposed model has the ability to converge to an optimum number of computational layers by forcing transfer functions of redundant layers to identity functions. The model could achieve better configuration due to greater flexibility, facilitated by the combination of hybrid computation adapted at different layers. In addition, the objective function for optimization (Equation (6)) is designed to regularize the connection weights to ensure better generalization.
The remainder of this paper is organized as follows: Section 2 contains a formal mathematical model of a hybrid MLP neural network and includes its equivalent representation in the form of an OCP. In Section 3, we study the obtained MLP network dynamics in the context of general switched control systems and consider the corresponding switched-type OCP. The proposed approach involves a constructive description of the permutation of layers in the given hybrid MLP network. Section 4 includes a critical discussion of the applicability of the generic Pontryagin-type optimality conditions to a practical implementation of the results of OCP. Motivated by the above critical idea, in Section 5, we develop an alternative approach to the switched OCP solution associated with the hybrid network optimization problem, i.e., the original optimal configuration problem for a hybrid MLP neural network. This approach involves a specific variant of the conventional reduced-gradient algorithm associated with the switched OCP. Section 6 summarizes the paper with indications to future extensions of the idea.
2. A Discrete-Time Optimal Control Framework for the Feedforward Hybrid Deep Learning Networks
In this section, we develop a mathematically rigorous OCP-based formalism for the optimal training problem associated with a hybrid fully connected neural network. By hybrid, we mean that the different layers can use different activation functions. This novel system-theoretic formalization provides an equivalent modelling approach for the main optimal configuration problem for a hybrid MLP neural network.
Recall that the training of a fully connected feedforward neural network is often considered in the a nonlinear programming framework [7]. In this case, one solves a suitable constrained or unconstrained optimization problem [9]. In supervised learning, the key idea is to find a classifier by estimating function (model) parameters of a known parametric function constrained by a supervised dataset. The formal procedure of finding the optimal parameters in the above setting is usually considered as a generalized regression problem.
An artificial neural network consists of processing units (also called nodes or neurons) that can be described in the framework of system inputs and outputs. Therefore, a suitable formal description of a multilayered neural network naturally leads to an equivalent discrete-time dynamic system, where training imparts dynamic behaviour, and its convergence to an optimal state is the goal. Consequently, the main nonlinear optimization problem of training a feedforward deep neural network can also be studied in the context of a specific constrained OCP, though the convergence mechanisms are different. Neural network trainings are gradient descent algorithms, where the rate of convergence is controlled by “learning rate”. In OCP, the convergence, which is controlled by changing the discrete step size, is faster.
Let be the total number of layers in the given hybrid neural network and be the maximal sample size of the dataset. Given a collection of hybrid network layers, with the inputs being
we obtain the generated network outputs, either real (as a regression problem) or integer (as a classification problem), according to the supervised data, as
by the following composition of hybrid-type transfer operators:
Here,
where is the set of trainable parameters. Each transfer operator
(Equation (1)) describes the layer of the neural network under consideration. The additional index for , where , defines a specific selection of the network architecture from a given hybrid collection. The family of given network architectures in the hybrid setting (1) is formalized by a finite index set . This set involves a consistent formulation of the optimal structuring problem for a hybrid neural network. Note that the composition of the successive input-to-output transfer operators in (1) constitute the dynamic system.
Recall that in practical applications, the input collection of a fully connected deep feedforward neural network (1) usually consists of some images, time-series, or other suitable input data. We consider the feedforward network architecture in any hybrid setting. The hybrid nature of this network is assumed to be given by the different activation functions
associated with the layers.
Note that in a conventional MLP framework, the trainable network parameter in (1) constitutes a pair:
where
We refer to [8] for the necessary technical details and practical implementations of conventional MLPs. We now introduce the following set of all admissible parameters in (1):
The definition in (2) of the successive network parameters in a neural network implies a formal characterization of the transfer operators in (1):
We next implement this abstract operator-based description (1) of the dynamics and consider the family
of the state transformation functions
In the layered network framework mentioned above, these state transformation functions are defined as follows:
for and . By
we next denote in (4) a family of admissible activation functions for the hybrid neural network. For every , the activation function acts component-wise on the corresponding -dimensional vector-argument:
Therefore, an activation function can formally be determined by a scalar mapping for every . Recall that various choices of an activation function are available in the literature (see, e.g., [6,7]). Clearly, the hybrid deep neural network states in (4) imply the corresponding hybrid state collection:
For the given network, the basic relation (4) implies a trainable state transformation of the following type:
where are the given network inputs. The dynamics of a fully connected feedforward hybrid neural network is given by (5). This relation describes the interplay of layers in a hybrid MLP model and can be naturally interpreted as a discrete-time-controlled dynamic system. For abbreviation, we refer to a hybrid deep neural network system as HDNNS.
We also note that the resulting discrete-time system (5) constitutes a specific example of a dynamics with the changing state dimensionality. We refer to [10] for some optimization approaches to systems with variable state dimensionality.
As mentioned above, the problem of training a fully connected feedforward neural network can now be considered as a specific nonlinear optimization problem with an additional structural optimization. For conventional (non-hybrid, where all layers have the same computational properties) multilayer neural network learning models, the optimization variable problem consists of the network parameters , where . We refer to [6,7,8] for further concepts and formal details. In contrast to the conventional learning, we now need to enlarge the set of optimization variables for the hybrid network learning models and include the additional discrete index variable into the optimization framework. Let
and
We have the parameter admissibility condition . Let
Now, we consider an objective functional
associated with the optimal training design of the hybrid MLP neural network. The main training problem, namely, the optimal structuring problem for a hybrid MLP network (see Introduction) can now be written in the form of the following OCP:
The network parameter and the hybrid indices, namely, the pair
can now be interpreted as a “control input” of the dynamic system (5). We use the control theoretic notation that is naturally motivated by the resulting OCP (6). The control input in problem (6) expresses the trainable design of a feedforward neural network. This design also includes the optimal structuring of the given layers. The goal of the resulting network training is to determine optimal parameters and the network structure (the hybrid indices ) such that the objective function is minimized. In the case of learning models, the objective functional usually includes differences between the final network output and some known targets, called “training labels”.
We now introduce an additional necessary notation. By , we next denote the final network output, namely, a solution of the HDNNS (5) for a concrete control pair and for an input of . Generally, the objective functional in (6) can be defined as follows:
Here,
is a sufficiently smooth function. Let be the vectors of training labels. Here, is the number of labels. In the framework of an MLP network optimization, one usually considers the following function :
where is a so-called classifier applied component-wise to a vector, , and is the Euclidean norm. Recall that . Let us observe that the generic DL optimization models usually involve an average objective functional of the type (7).
The main optimization problem (6) associated with a hybrid MLP neural network can be characterized as a discrete-time OCP with a switched structure. We next assume that this main OCP (6) possesses an optimal solution (optimal control):
The obtained main OCP (6) involves a so-called “terminal objective functional” (also called the Mayer functional). The control input for this system is an admissible pair . An optimal control design formalized by problem (6) implies an adequate training procedure of the given MLP neural network with a hybrid structure. The process of finding the optimal parameters in optimization problem (6) can also be interpreted as a generalized (dynamic) regression problem. We refer to [11] for the necessary mathematical details and some important control theoretical results related to a class of dynamic optimization problems of the type (6).
The practically oriented DL models often include a regularized version of the initially given objective functional . In such a case, the objective functional in (6) is replaced by the corresponding regularization:
where is a suitable regularization function. For the generic regularization framework (9), one can consider the celebrated Tikhonov–Phillips concept [7]. In a modern ML, this regularization method is also known as a Weight Decay approach (see, e.g., [6]). Note that depending on the concrete application, many different choices of a suitable regularization function are possible. We refer to [12] for a detailed discussion related to the proximal-point regularization methodology. Let us also note that in the case of a regularized functional , the resulting OCP constitutes a Bolza-type problem.
3. Application of a Switched System Methodology to the Hybrid Deep Learning Model
The HDNNS (5) introduced in the previous section describes a learning process of the hybrid MLP network. It exhibits both discrete-time and combinatorial dynamic behaviour. The combinatorial structure of system (5) and that of the main OCP (6) is given by the hybrid indices with . In order to develop a constructive analytic description of this combinatorial part of the system, we next consider (5) in the context of switched control systems [11,13,14].
Recall that the switched system methodology has been established as a powerful modelling and solution approach to a wide class of complex real-world dynamic systems. We refer to [11] for the necessary theoretical concepts and practical engineering applications of switched systems.
We now use the generic definition of a switched control system from [11] and propose a novel formal description adapted to the HDNNS abstraction (5).
Definition1.
A switched dynamic system associated with the HDNNS model (5) is a collection of , where the following apply:
, is a sequence of switching times associated with an implemented sequence of layers in the HDNNS such that
is a reset set for the network layers.
Note that in the conventional hybrid/switched systems and control theory, the index set Q in Definition 1 is sometimes called a “set of locations”. A “location” of the system under consideration is specified by a concrete variable .
The switched dynamic system from Definition 1 is assumed to be determined on an interval . A concrete sequence l of switching times
defines a partition of the interval by some adjoint subintervals associated with every hybrid layer index
We now use the results of [11,12,14] and rewrite the complete dynamics of a switched system from Definition 1 in the following compact form:
Here, is the characteristic function of the interval .
The concept of a switched dynamic system from Definition 1 applied to the HDNNS model (5) leads to a constructive (non-combinatorial) representation of the control variable in system (5) and in the main OCP (6). The following vector-function
makes it possible for a function-based representation of the combinatorial control variable . We next define a set of all admissible characteristic functions for the discrete-time switched system (10):
where . From Definition 1, it is easy to see that set is in a one-to-one correspondence to the set of all admissible sequences l of switching times. These switching times correspond to the possible changes in the hybrid MLP network structure according to the main optimal structuring problem that we consider.
As mentioned above, the developed switched system approach to the HDNNS makes it possible to replace the control variable with an equivalent expression of in system (5). Let be a solution of the switched dynamic system (10) for a concrete selection of the control input
Using the switched system’s formalism, we now rewrite the main OCP (6) in the following form:
The objective functional in Equation (11) can be rewritten as
Evidently, the objective in (12) is a combined functional.
For the given hybrid MLP network model, the state represents the final output of the neural network learning process. In accordance with the originally given OCP (6), we next assume that the switched-type problem (11) possesses an optimal solution
The corresponding optimal solution of the switched dynamic system (10) is next defined by .
The switched-type problem (11) can be considered as a “constructive” version of the initially presented OCP stated in (6). It contains a numerically tractable formalization of the combinatorial control component in (11). Similarly to the case of the original OCP regularization (9), we also introduce a regularized version of the objective functional of the problem in (11):
The selection of a concrete regularization functional in (13) can be made in the same way as in problem (9).
4. A Critical Analysis of the Necessity of Optimality Conditions of Pontryagin Type
This section is devoted to the necessary optimality conditions for the switched-type OCP in Equation (11). These optimality conditions are given in the form of a Pontryagin Maximum Principle (PMP) for the general optimal control processes governed by switched dynamics. We refer to [15,16] for the necessary technical details, examples, and analytical results. The applicability analysis of the PMM for a concrete class of OCPs (11) involving the switched control systems (10) shows the inconsistency of a possible application of this PMP. As will be shown in this section, the switched systems’ approach to the hybrid feedforward network learning involves the non-effectiveness of the celebrated PMP for problem (6). Since OCP (11) constitutes a mathematically equivalent representation of the hybrid MLP neural network learning problem (6), the same inconsistency conclusion will be true for the initially given OCP (6). This important fact provides the main motivation for the further development of some alternative solution schemes for the network learning problem under consideration.
Recall that the celebrated PMP expresses the necessary optimality conditions in many types of the conventional, hybrid, and switched-type OCPs. Consider now the switched-type OCP (6) and introduce the corresponding Hamiltonian function:
By , we denote in (14) the generic adjoint variable, and is a scalar product in an Euclidean space Z. For every location
of the switched control system (10), we can define the corresponding partial Hamiltonian:
Evidently,
Recall that the classical PMP for a discrete-time OCP expresses the necessary optimality condition in terms of a Hamiltonian function. We now present an advanced version of the optimality conditions, namely, the PMP for the concrete discrete-time switched OCP (11).
Theorem1.
Consider the discrete-time switched OCP (11) and assume that it has an optimal solution
and is Lagrange-regular. Assume that the right-hand side in HDNSN (5) is given by (4), and the concrete function in (12) is defined by (8). Let the classifier be a continuously differentiable function. Assume further that for each and , the orientor field
is convex. Then, there exists an adjoint process
such that the following relations are satisfied for and :
We refer to [11,17], and some references therein, for the formulation and proof of the PMP for OCP (11). The classic PMP for the conventional discrete-time OCPs can be found in [16]. Let us note that the derivative
in (16) can be explicitly calculated in the framework of an MLP network. One uses the concrete expression (8) of function for this purpose. Moreover, in the case of a regularized functional determined by (13), the corresponding Hamiltonian of the resulting regularized OCP can be defined as follows:
Evidently, in this case, the regularization functional only appears in the last relation (inequality) of the generic system (16) of optimality conditions.
Versions of the PMP related to concrete OCPs provide a theoretical foundation for an important class of numerical methods in optimal control. This class involves the so-called indirect computational methods for OCPs. We refer to [11,12] for an overview. However, the PMP mentioned above includes a specific assumption, namely, the convexity condition of the orientor field . This convexity assumption constitutes the most crucial condition of Theorem 1. A possible violation of this assumption will imply an incorrectness of the PMP optimality conditions for the specific OCP (11). The same is also true with respect to the initially given hybrid MLP neuronal network learning problem (6). Note that in the case of the regularized OCP with the cost functional and with the correspondingly regularized Hamiltonian , the conditions of Theorem 1 needs to be extended by the additional convexity assumption of the regularization function (see, e.g., [11]).
It is necessary to stress that the convexity assumption for the orientor field has no relations to the convexity of the right-hand side of the switched dynamic system (10). We refer to [11,15,16] for various examples of the specific convex and non-convex orientor fields.
In the case of MLS trainable layers, the functions
have a quasi-affine structure with respect to
(see (4)). This special case includes many important types of layers, for example, the fully connected layers, convolution layers, and batch normalization layers (see, e.g., [6,7]). One can assume that the convexity of the parameter is set as . However, the non-convexity of the set implies the non-convexity of the complete (product) control set
even in the case of a convex . This above property, the nonlinearity of the activation function and the specific bilinear (non-convex) structure of the summands
in the right-hand side of (10) exclude the necessary convexity property of the orientor field required in Theorem 1. When is a non-convex set then, in general, it is not true that the PMP from Theorem 1 constitutes the necessary optimality conditions for the switched-type OCP (11).
We now discuss an additional counter-argument for a possible consideration of the PMP (Theorem 1) for the numerical treatment of the switched OCP (11). It is well known that the necessary optimality condition in the form of a PMP provides a numerically consistent algorithm in a full space (for example, in a Lebesgue space) of admissible control functions. On the other hand, a direct application of the corresponding PMP to problem (11) does not guarantee admissibility of the numerical optimal solution in the sense of problem (11), i.e., it does not guarantee the required condition:
The above counter-argument shows that the celebrated PMP and the corresponding computational solution procedures cannot be applied directly to the specific problem (11). The “admissibility problem” mentioned above is a direct consequence of the following pivotal theoretic observation related to a formal mathematical proof of the PMP: the set of possible needle variations associated with the characteristic functions in problem (11) constitutes a very “poor” set of variations (see, e.g., [15]). As a consequence, one cannot derive the generic adjoint equation that guarantees the belonging of the numerically optimal value to the set of admissible switched controls. The possible numerical application of the optimality condition (system) (16), including the adjoint equation, namely, the difference equation for variable in (16), generally implies the inadmissibility of the resulting numerically optimal variable . Since is a non-convex set, we cannot effectively apply any suitable projection approach for .
The two main critical arguments discussed above reflect the conceptual applicability problem of the generic PMP in the context of numerical treatments of the OCP (11). This fact significantly restricts the possible numerical application of the presented Theorem 1 and the corresponding indirect solution algorithms. This situation provides a main motivation for the development of novel, direct-solution techniques for OCPs associated with the optimal learning of hybrid fully connected neural networks.
5. A Reduced-Gradient Approach to the Switched-Type OCP for Hybrid Neural Networks
This section presents a novel solution scheme for the switched-type OCP (11) related to the optimal structuring problem for a hybrid MLP neural network. As mentioned in Section 5, the possible application of the indirect (PMP-based) numerical method for the switched-type OCP (11) involves some conceptual difficulties. On the other hand, the specific structure of the optimization problem under consideration makes it possible to derive a constructive expression for the so-called “reduced gradient” of the objective functional in OCP (11). The explicit characterization of the reduced gradient discussed in the next theorem will be used as an analytic basis for a specific first-order solution algorithm that we propose. This algorithm can be applied directly to the original switched OCP (11) as well as to the regularized version of the problem.
Consider OCP (11) associated with the given hybrid neural network of the MLP type with the continuously differentiable functions from (8). Since the concrete functions in (3) are also continuously differentiable, we derive the Fréchet differentiability of the objective functional in OCP (11). The corresponding gradient of is next denoted by .
We now apply the generalized reduced-gradient method that is comprehensively discussed in [11,12] to the concrete OCP (11) and obtain the following formal results.
Theorem2.
Consider the switched OCP (11) and assume that all conditions of Theorem 1 are satisfied. The reduced gradient of the objective functional in problem (11) can be computed as a solution of the following system of equations:
where is an adjoint variable.
By and , we denote here the partial derivative of H with respect to and , respectively. A complete proof of Theorem 2 in the case of a fixed can be found in [11,12]. Some similar results for concrete classes of discrete-time and continuous-time OCPs are obtained in [17].
Using expression (14) for the system Hamiltonian , we can calculate the elements and of the gradient vector in (17):
and
By and , we denote here the transpose operation. Let us note that a detailed calculation of the partial derivative
depends on a concrete selection of an activation function in (4).
Consider now the regularized functional from (13) and the corresponding regularized OCP (11). We assume that the regularizing functions
are continuously differentiable. In this case, the Hamiltonian introduced in (14) needs to be replaced by its regularized version that was introduced in the previous section:
Using this replacement, the reduced gradient in the regularized OCP can be found from the corresponding extension of the basic system of Equation (17) from Theorem 2. Let us formulate the corresponding result for a regularized OCP (11).
Theorem3.
Consider a regularized version of the switched OCP (11) and assume that all conditions of Theorem 1 are satisfied. The reduced gradient of the objective functional can be computed as a solution of the following system of equations:
where denotes the zero vector in the Euclidean space .
The results of Theorems 2 and 3, namely, the constructive expressions for the reduced gradients in the originally given OCP (11) and in the regularized problem, make it possible to consider various first-order gradient-based computational techniques for a numerical treatment of the switched OCP (11). Let us present here the generic reduced-gradient algorithm for problem (11):
Here, is the iteration index. Clearly, for the objective functional from the regularized version of problem (11), the gradient in (19) needs to be replaced by the constructive expression of
We now study the numerical stability of the basic gradient method (19). Recall that the convergence properties of the generic reduced-gradient method for the switched system optimization was comprehensively discussed by many authors (see, e.g., [11,17,18] and references therein). Concretely, we can prove the following special convergence result for the reduced-gradient algorithm (19).
Theorem4.
Consider OCP (11) and assume that all conditions of Theorem 1 are satisfied. Let
be a sequence generated by method (19). Then, for an admissible initial point
the above sequence (20) is a minimizing sequence for problem (11), i.e.,
Moreover, this sequence converges weakly to a solution of the switched-type OCP (11).
Proof.
The property of sequence (20) to be a minimizing sequence for problem (11) immediately follows from [9]. Moreover, the concrete functions
where F is defined in (3), are Lipschitz continuous and possess the Lipschitz continuous derivative for all . Since the composition of two Lipschitz continuous mappings involves the same property for the resulting mapping, we obtain the Lipschitz continuity of the gradient . The weak convergence of sequence (20) generated by method (19) to the optimal pair
Let us make some observations related to Theorem 4. Since and is a subset of a finite-dimensional Euclidean space (see Section 2), the weak convergence of the first component in (20) to coincides with the norm convergence in this Euclidean space. The weak convergence of the second component in (20) to is in fact a weak convergence in the Hilbert space of all square integrable functions of the corresponding dimensionality.
Note that the inclusion condition
for the updated iterations in (19) can be implemented using a projection. Assume that set is convex. In that case, one needs to consider the following convexification of the product set :
where is a convexification of the function set . We refer to [11,20] for further theoretical and computational details related to the relaxation (convexification) procedures for gradient methods.
Finally, note that the proposed reduced-gradient algorithm provides an adequate alternative to the PMM-based optimality conditions critically discussed in Section 4. It constitutes a novel, OCP-based approach to the optimal structuring problem for a hybrid MLP neural network.
6. Concluding Remarks
In this paper, we developed a novel, optimal control-based approach to an optimization problem associated with the training of the feedforward fully connected hybrid neural network. We use the control theoretical approach and reformulate the main optimal structuring problem for a hybrid MLP neural network in the form of a specific switched-type OCP. This equivalent reformulation makes it possible to apply the advanced optimal control methodology to the training and structural optimization of a given hybrid MLP network. To be precise, we considered the optimal control of discrete-time switched systems and developed a reduced-gradient method for the practical treatment of the resulting switched OCP.
Let us note that the proposed equivalent representation of the trainable network design has a potential to be used in the optimal training of a convolution neural network. This optimistic conclusion follows from the general structure of the CNN (see, e.g., [21,22]) and from the existence of many well-developed theoretical results and computational methods for optimal control.
The equivalent reduction in a given sophisticated fully connected neural network design problem to a semiclassical OCP shows the effectiveness of the proposed optimal control methodology for neural network learning.
This is a preliminary proposal to optimize the connection weights of sequential layers of computing nodes to accomplish a mapping from a domain to a range, which is the task performed by a neural network. This model is capable of accommodating different activation functions at different layers, which prompted us to call it a hybrid (layer) neural network. This work constitutes initial theoretical research on the subject. The advanced optimal control approach discussed here is a complementary technique in the context of existing neural network training algorithms. To establish our approach as a viable alternative to established learning algorithms, we need to prototype and simulate the model. Comprehensive simulation studies for performance analysis and comparison with the traditional method of learning, both in the context of computational efficiency and quality of results, are needed. This is our future work.
Author Contributions
Conceptualization, G.C. and V.A.; Methodology, G.C.; Validation, L.A.G.T.; Formal analysis, V.A.; Writing—original draft, G.C.; Writing—review & editing, V.A., L.A.G.T. and G.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
The set of inputs of the supervised data, 0 is the label for input
The dimension of the input data
The set of outputs of the supervised data, T is the label for output
The dimension of the output data
T
Thus, , the integer T is the number of layers,
as well as the output layer and the output data label
S
The number of training data s is used as the index of the training data
is the input data
is the output data
Index for network layer number
Number of input nodes (dimension of the input) to layer
Number of output nodes (dimension of the output) of layer
are the trainable parameters of the network layer
The connection weights between input and output nodes of layer
The connection weights of the bias node of layer
Q
A finite index set for a family of predefined computational models;
this allows the network architecture to accommodate hybrid layers
is a specific option from
, i.e., there could be layers of same computation models
The transfer function of layer using
The combined transfer function of two consecutive network layers
Layer ’s output is the input to layer
is the set of admissible parameters of
The layer transfer operator, where
The transfer function from input to output vector
The activation of type ; thus,
The variables for and , respectively
The objective function for regularization expressed in Equation (6)
References
Rumelhart, D.E.; McClelland, L.J. Parallel Distributed Processing Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1987. [Google Scholar]
Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw.1989, 2, 359–366. [Google Scholar] [CrossRef]
Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw.1997, 4, 251–257. [Google Scholar] [CrossRef]
Zhong, S.; Cherkassky, V. Factors controlling generalization ability of MLP networks. In Proceedings of the IJCNN’99 (International Joint Conference on Neural Networks), Washington, DC, USA, 10–16 July 1999. [Google Scholar]
Chakraborty, G.; Murakami, M.; Shiratori, N.; Noguchi, S. A Growing Network that optimizes between undertraining and Overtraining. In Proceedings of the IEEE International Conference on Neural Network, Perth, Australia, 27 November–1 December 1995; Volume 2, pp. 1116–1121. [Google Scholar]
Benning, M.; Calledoni, E.; Ehrhardt, M.J.; Owren, B.; Schoenlieb, C.-B. Deep learning as optimal control problems: Models and numerical methods. J. Comput. Dyn.2015, 6, 171–198. [Google Scholar] [CrossRef]
Li, Q.; Hao, S. An optimal control approach to deep learning and applications to discrete-weight neural networks. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw.2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
Galván-Guerra, R.; Azhmyakov, V.; Egerstedt, M. Optimization of multiagent systems with increasing state dimensions: Hybrid LQ approach. In Proceedings of the 2011 American Control Conference, San Francisco, CA, USA, 29 June–1 July 2011; pp. 881–887. [Google Scholar]
Azhmyakov, V. A Relaxation Based Approach to Optimal Control of Switched Systems; Elsevier: Oxford, UK, 2019. [Google Scholar]
Azhmyakov, V.; Basin, M.; Raisch, J. A proximal point based approach to optimal control of affine switched systems. Discret. Event Dyn. Syst.2012, 22, 61–81. [Google Scholar] [CrossRef]
Atlee Jackson, E. On the control of complex dynamic systems. In Physica D: Nonlinear Phenomena; Elsevier: Amsterdam, The Netherlands, 1991; Volume 50, pp. 341–366. [Google Scholar]
Egerstedt, M.; Wardi, Y.; Axelsson, H. Transition-time optimization for switched systems. IEEE Trans. Autom. Control2006, AC-51, 110–115. [Google Scholar]
Boltyanski, V.; Poznyak, A. The Robust Maximum Principle; Birkhauser: New York, NY, USA, 2012. [Google Scholar]
Halkin, H. A Maximum Principle of the Pontryagin type for systems described by nonlinear difference equations. SIAM J. Control1966, 4, 90–111. [Google Scholar] [CrossRef]
Teo, K.L.; Goh, C.J.; Wong, K.H. A Unifed Computational Approach to Optimal Control Problems; Wiley: New York, NY, USA, 1991. [Google Scholar]
Polak, E. Optimization; Springer: New York, NY, USA, 1997. [Google Scholar]
Roubicek, T. Relaxation in Optimization Theory and Variational Calculus; De Gruyter: Berlin, Germany, 1997. [Google Scholar]
Saleh, A.M.; Hamoud, T. Analysis and best parameters selection for person recognition based on gait model using CNN algorithm and image augmentation. J. Big Data2021, 8, 1. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Lin, J.; Wang, Z. Efficient convolution architectures for convolutional neural network. In Proceedings of the 2016 8th International Conference on Wireless Communications and Signal Processing (WCSP), Yangzhou, China, 13–15 October 2016; pp. 1–5. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.