Next Article in Journal
Remaining Useful Life Prediction of Turbofan Engine in Varied Operational Conditions Considering Change Point: A Novel Deep Learning Approach with Optimum Features
Previous Article in Journal
Adaptive Asymptotic Shape Synchronization of a Chaotic System with Applications for Image Encryption
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Formal Approach to Optimally Configure a Fully Connected Multilayer Hybrid Neural Network

by
Goutam Chakraborty
1,2,*,†,
Vadim Azhmyakov
3,*,† and
Luz Adriana Guzman Trujillo
4,†
1
Department of Software & Information Science, Iwate Prefectural University, Iwate Ken, Takizawa 020-0693, Japan
2
Department of Computer Science & Engineering, Madanapalle Institute of Technology & Science, Madanapalle 517325, A.P., India
3
Department of Mathematics, Madanapalle Institute of Technology & Science, Madanapalle 517325, A.P., India
4
LARIS, University of Angers, 49000 Angers, France
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2025, 13(1), 129; https://doi.org/10.3390/math13010129
Submission received: 11 November 2024 / Revised: 16 December 2024 / Accepted: 25 December 2024 / Published: 31 December 2024

Abstract

:
This paper is devoted to a novel formal analysis, optimizing the learning models for feedforward multilayer neural networks with hybrid structures. The proposed mathematical description replicates a specific switched-type optimal control problem (OCP). We have developed an equivalent, optimal control-based formulation of the given problem of training a hybrid feedforward multilayer neural network, to train the target mapping function constrained by the training samples. This novel formal approach makes it possible to apply some well-established optimal control techniques to design a versatile type of full connection neural networks. We next discuss the irrelevance of the necessity of Pontryagin-type optimality conditions for the construction of the obtained switched-type OCP. This fact motivated us to consider the so-called direct-solution approaches to the switched OCPs, which can be associated with the learning of hybrid neural networks. Concretely, we consider the generalized reduced-gradient algorithm in the framework of the auxiliary switched OCP.

1. Introduction

Although it has been almost forty years since the multilayer perceptron (MLP) was proposed [1], it is still widely used as a full-connection classifier along with deep convolution neural networks. MLPs constitute fully-connected feedforward neural networks. It was proven that MLPs are universal approximators [2,3]. The importance of MLPs can never be overstated, since they are key classifier models due to their expressive power guaranteed by the universal approximation theorem [3]. However, the important issue is the generalization property of a well-trained network [4]. During the 1990s, balancing between the bias and variance of the network, which is achieved by appropriate choice of the number of layers and the number of nodes, was an important research topic issue (see, e.g., [5] and references therein). This issue was mitigated by regularizing the connection weights.
However, a formal method to address the problem of an optimal configuration of the network architecture, to determine the number of hidden layers and their constituent number of units, was never properly addressed. A generic MLP consists of computational layers, e.g., C 1 , C 2 , C 3 , , C t , , C T , with the same computational model at every layer, but with different number of hidden units. The connection weights between layers are the parameters of the model. A hybrid MLP with trainable parameters involves layers with different activation functions. Except for the general convention of using a larger number of computing nodes for layers closer to the input, there is no hypothesis on how the layers with different activation functions should be organized in order to obtain an optimal overall performance. Of course, various optimization criteria can be defined. Two basic optimization criteria are a faster convergence and better generalization of the mapping function.
For our analysis, considering layers as C 1 , C 2 , C 3 , , C t , , C T , we assume that the number of layers T is fixed, and that every layer has an equal number of units, also called nodes. A node can be disabled (ineffective) by setting all its connecting weights as zeros. We can drop a layer by making it a identity function, for which the output is the same as the input. If we start with a sufficiently large T, we can always converge to the optimal neural network using our algorithm. We always start with the generic model, and update the network parameters to find an optimal configuration for the hybrid feedforward network for the task.
For a concrete solution to optimize the neural network configuration, i.e., the network architecture, we will use a specific optimal control methodology that was initially developed for the optimal design of a switched control system. A conventional MLP neural network can be interpreted as a discrete-time dynamic system [6,7] (and references therein). Despite the impressive results seen in the real-world applications, some basic formal aspects of the analysis of modern feedforward deep network models remain unsolved. This fact motivates the development of possible equivalent interpretations of generic neural network dynamics in the form of some known mathematical abstractions. This equivalent representation of the hybrid MLP neural network in the form of a discrete-time control system makes it possible to apply the switched optimal control methodology to the main optimal structuring problem mentioned above.
Generally, the optimal control approach is not adequate to optimize feedforwardrd deep neural networks, or their variants as in [8]. In this paper, we reduce the optimal configuration problem for a hybrid multilayer neural network model to a specific switched-type OCP. We then discuss some critical aspect of the Pontryagin optimality principle in the framework of the resulting OCP and develop a reduced-gradient method for this problem.
The novelty of the proposed model is its hybrid computational architecture. For the proposed architecture, all layers are fully connected. At different layers, the activation functions could be different. They are automatically chosen from a set of Q T possibilities to optimize the objective fuction defined in Equation (6). T is the number of layers set initially, and is fixed. Refs. [6,7] proposed training neural networks using tools from optimum control. In their proposals, all layers use the same computational model. This greatly limits the flexibility of the model to optimize. In the proposed hybrid network presented herein, although the number of layers is set to a fixed T, the computation q at different layers automatically converges to optimum, choosing the best one from a variety of Q possibilities. For example, the proposed model has the ability to converge to an optimum number of computational layers by forcing transfer functions of redundant layers to identity functions. The model could achieve better configuration due to greater flexibility, facilitated by the combination of hybrid computation adapted at different layers. In addition, the objective function for optimization (Equation (6)) is designed to regularize the connection weights to ensure better generalization.
The remainder of this paper is organized as follows: Section 2 contains a formal mathematical model of a hybrid MLP neural network and includes its equivalent representation in the form of an OCP. In Section 3, we study the obtained MLP network dynamics in the context of general switched control systems and consider the corresponding switched-type OCP. The proposed approach involves a constructive description of the permutation of layers in the given hybrid MLP network. Section 4 includes a critical discussion of the applicability of the generic Pontryagin-type optimality conditions to a practical implementation of the results of OCP. Motivated by the above critical idea, in Section 5, we develop an alternative approach to the switched OCP solution associated with the hybrid network optimization problem, i.e., the original optimal configuration problem for a hybrid MLP neural network. This approach involves a specific variant of the conventional reduced-gradient algorithm associated with the switched OCP. Section 6 summarizes the paper with indications to future extensions of the idea.

2. A Discrete-Time Optimal Control Framework for the Feedforward Hybrid Deep Learning Networks

In this section, we develop a mathematically rigorous OCP-based formalism for the optimal training problem associated with a hybrid fully connected neural network. By hybrid, we mean that the different layers can use different activation functions. This novel system-theoretic formalization provides an equivalent modelling approach for the main optimal configuration problem for a hybrid MLP neural network.
Recall that the training of a fully connected feedforward neural network is often considered in the a nonlinear programming framework [7]. In this case, one solves a suitable constrained or unconstrained optimization problem [9]. In supervised learning, the key idea is to find a classifier by estimating function (model) parameters of a known parametric function constrained by a supervised dataset. The formal procedure of finding the optimal parameters in the above setting is usually considered as a generalized regression problem.
An artificial neural network consists of processing units (also called nodes or neurons) that can be described in the framework of system inputs and outputs. Therefore, a suitable formal description of a multilayered neural network naturally leads to an equivalent discrete-time dynamic system, where training imparts dynamic behaviour, and its convergence to an optimal state is the goal. Consequently, the main nonlinear optimization problem of training a feedforward deep neural network can also be studied in the context of a specific constrained OCP, though the convergence mechanisms are different. Neural network trainings are gradient descent algorithms, where the rate of convergence is controlled by “learning rate”. In OCP, the convergence, which is controlled by changing the discrete step size, is faster.
Let T N be the total number of layers in the given hybrid neural network and S N be the maximal sample size of the dataset. Given a collection of hybrid network layers, with the inputs being
Y 0 : = { y s , 0 R N 0 , s = 1 , , S } ,
we obtain the generated network outputs, either real (as a regression problem) or integer (as a classification problem), according to the supervised data, as
Y T : = { y s , T R N T , s = 1 , , S }
by the following composition of hybrid-type transfer operators:
Y T = ( ϕ T q I ϕ T 1 q I 1 ϕ 1 q 1 ) ( Y 0 , θ 1 ) .
Here,
ϕ τ + 1 q ϕ τ q : = ϕ τ + 1 q ( ϕ τ q ( Y τ 1 , θ τ ) , θ τ + 1 ) , q , q Q , τ = 1 , , T 1
where θ τ is the set of trainable parameters. Each transfer operator
ϕ τ q i ( · ) , q i Q , i = 1 , , I N
(Equation (1)) describes the τ layer of the neural network under consideration. The additional index q i Q for i = 1 , , I , where I < T , defines a specific selection of the network architecture from a given hybrid collection. The family of given network architectures in the hybrid setting (1) is formalized by a finite index set Q N . This set involves a consistent formulation of the optimal structuring problem for a hybrid neural network. Note that the composition of the successive input-to-output transfer operators in (1) constitute the dynamic system.
Recall that in practical applications, the input collection Y 0 of a fully connected deep feedforward neural network (1) usually consists of some images, time-series, or other suitable input data. We consider the feedforward network architecture in any hybrid setting. The hybrid nature of this network is assumed to be given by the different activation functions
σ τ q ( · ) , τ = 1 , , T , q i Q , i = 1 , , I N
associated with the layers.
Note that in a conventional MLP framework, the trainable network parameter θ τ in (1) constitutes a pair:
θ τ : = W τ , b τ , τ = 1 , , T ,
where
W τ R N τ 1 × N τ , b τ R N τ .
We refer to [8] for the necessary technical details and practical implementations of conventional MLPs. We now introduce the following set of all admissible parameters θ τ in (1):
Θ τ R N τ 1 × N τ R N τ , τ = 1 , , T .
The definition in (2) of the successive network parameters θ τ Θ τ in a neural network implies a formal characterization of the transfer operators in (1):
ϕ τ q : R N τ 1 × Θ τ R N τ τ = 1 , , T .
We next implement this abstract operator-based description (1) of the dynamics and consider the family
F : = { f τ q ( · , · ) } τ = 1 , , T q Q
of the state transformation functions
f τ q : R N τ 1 × Θ τ R N τ .
In the layered network framework mentioned above, these state transformation functions are defined as follows:
f τ q ( y s , τ 1 , θ τ ) = σ τ q W τ y s , τ 1 + b τ , σ τ q ( · ) χ , θ τ Θ τ , τ = 1 , , T
for q Q and s = 1 , , S . By
χ : = { σ τ q ( · ) } τ = 1 , , T q Q ,
we next denote in (4) a family of admissible activation functions σ τ q ( · ) for the hybrid neural network. For every q Q , the activation function σ τ q ( · ) acts component-wise on the corresponding N τ -dimensional vector-argument:
( W τ y s , τ 1 + b τ ) .
Therefore, an activation function can formally be determined by a scalar mapping for every q Q . Recall that various choices of an activation function are available in the literature (see, e.g., [6,7]). Clearly, the hybrid deep neural network states y s , τ 1 in (4) imply the corresponding hybrid state collection:
Y τ 1 : = { y s , τ 1 R N τ 1 , s = 1 , , S } , τ = 1 , , T .
in accordance with the operator description (1).
For the given network, the basic relation (4) implies a trainable state transformation of the following type:
y s , τ = f τ q i ( y s , τ 1 , θ τ ) , τ = 1 , , T , i = 1 , , I , q i Q
where y s , 0 Y 0 are the given network inputs. The dynamics of a fully connected feedforward hybrid neural network is given by (5). This relation describes the interplay of layers in a hybrid MLP model and can be naturally interpreted as a discrete-time-controlled dynamic system. For abbreviation, we refer to a hybrid deep neural network system as HDNNS.
We also note that the resulting discrete-time system (5) constitutes a specific example of a dynamics with the changing state dimensionality. We refer to [10] for some optimization approaches to systems with variable state dimensionality.
As mentioned above, the problem of training a fully connected feedforward neural network can now be considered as a specific nonlinear optimization problem with an additional structural optimization. For conventional (non-hybrid, where all layers have the same computational properties) multilayer neural network learning models, the optimization variable problem consists of the network parameters θ τ Θ τ , where τ = 1 , , T . We refer to [6,7,8] for further concepts and formal details. In contrast to the conventional learning, we now need to enlarge the set of optimization variables for the hybrid network learning models and include the additional discrete index variable q Q into the optimization framework. Let
ϑ : = θ τ τ = 1 , , T , Θ : = Θ 1 Θ T ,
and
ρ : = q i i = 1 , , I .
We have the parameter admissibility condition ϑ Θ . Let
ρ Q I = Q Q I times .
Now, we consider an objective functional
J ˜ : Θ Q I R .
associated with the optimal training design of the hybrid MLP neural network. The main training problem, namely, the optimal structuring problem for a hybrid MLP network (see Introduction) can now be written in the form of the following OCP:
minimize J ˜ ( ϑ , ρ ) subject to y s , τ = f τ q i ( y s , τ 1 , θ τ ) , y s , 0 Y 0 , θ τ Θ τ , q i Q , τ = 1 , , T , s = 1 , , S , i = 1 , , I .
The network parameter and the hybrid indices, namely, the pair
( ϑ , ρ ) Θ Q I
can now be interpreted as a “control input” of the dynamic system (5). We use the control theoretic notation that is naturally motivated by the resulting OCP (6). The control input in problem (6) expresses the trainable design of a feedforward neural network. This design also includes the optimal structuring of the given layers. The goal of the resulting network training is to determine optimal parameters ϑ Θ and the network structure (the hybrid indices ρ Q I ) such that the objective function J ˜ ( · , · ) is minimized. In the case of learning models, the objective functional usually includes differences between the final network output x s , T X T and some known targets, called “training labels”.
We now introduce an additional necessary notation. By y s , T ϑ , ρ Y T , we next denote the final network output, namely, a solution of the HDNNS (5) for a concrete control pair ( ϑ , ρ ) Θ Q I and for an input of y s , 0 Y 0 . Generally, the objective functional J ˜ ( ϑ , ρ ) in (6) can be defined as follows:
J ˜ ( ϑ , ρ ) : = 1 S s = 1 S Ψ s ( y s , T ϑ , ρ ) .
Here,
Ψ s : R N T R , s = 1 , , S
is a sufficiently smooth function. Let c i R N T , i = 1 , , d be the vectors of training labels. Here, d N is the number of labels. In the framework of an MLP network optimization, one usually considers the following function Ψ s ( · , · ) :
Ψ s ( y s , T ϑ , ρ ) = i = 1 d | | C ( K y s , T ϑ , ρ + b T ) c i | | 2 2 ,
where C : R R is a so-called classifier applied component-wise to a vector, K R N T × N T , and | | · | | 2 is the Euclidean norm. Recall that b T R N T . Let us observe that the generic DL optimization models usually involve an average objective functional of the type (7).
The main optimization problem (6) associated with a hybrid MLP neural network can be characterized as a discrete-time OCP with a switched structure. We next assume that this main OCP (6) possesses an optimal solution (optimal control):
( ϑ opt , ρ opt ) Θ Q I .
The obtained main OCP (6) involves a so-called “terminal objective functional” (also called the Mayer functional). The control input for this system is an admissible pair ( ϑ , ρ ) . An optimal control design formalized by problem (6) implies an adequate training procedure of the given MLP neural network with a hybrid structure. The process of finding the optimal parameters ( ϑ opt , ρ opt ) in optimization problem (6) can also be interpreted as a generalized (dynamic) regression problem. We refer to [11] for the necessary mathematical details and some important control theoretical results related to a class of dynamic optimization problems of the type (6).
The practically oriented DL models often include a regularized version of the initially given objective functional J ˜ ( · , · ) . In such a case, the objective functional J ˜ ( · , · ) in (6) is replaced by the corresponding regularization:
J ˜ r e g ( ϑ , ρ ) : = J ˜ ( ϑ , ρ ) + 1 S s = 1 S τ = 1 T R τ ( θ τ ) .
where R τ : Θ τ R is a suitable regularization function. For the generic regularization framework (9), one can consider the celebrated Tikhonov–Phillips concept [7]. In a modern ML, this regularization method is also known as a Weight Decay approach (see, e.g., [6]). Note that depending on the concrete application, many different choices of a suitable regularization function R τ ( · ) are possible. We refer to [12] for a detailed discussion related to the proximal-point regularization methodology. Let us also note that in the case of a regularized functional J ˜ reg ( ϑ , ρ ) , the resulting OCP constitutes a Bolza-type problem.

3. Application of a Switched System Methodology to the Hybrid Deep Learning Model

The HDNNS (5) introduced in the previous section describes a learning process of the hybrid MLP network. It exhibits both discrete-time and combinatorial dynamic behaviour. The combinatorial structure of system (5) and that of the main OCP (6) is given by the hybrid indices ρ Q I with q i Q . In order to develop a constructive analytic description of this combinatorial part of the system, we next consider (5) in the context of switched control systems [11,13,14].
Recall that the switched system methodology has been established as a powerful modelling and solution approach to a wide class of complex real-world dynamic systems. We refer to [11] for the necessary theoretical concepts and practical engineering applications of switched systems.
We now use the generic definition of a switched control system from [11] and propose a novel formal description adapted to the HDNNS abstraction (5).
Definition 1.
A switched dynamic system associated with the HDNNS model (5) is a collection of { Q , F , l , Γ } , where the following apply:
  • Q N is a finite index set;
  • F is a family of vector fields introduced in (3);
  • l = { t i } , i = 1 , , I is a sequence of switching times associated with an implemented sequence of layers in the HDNNS such that
    0 = t 0 < t 1 < < t I 1 < t I T ;
  • Γ Ξ : = { ( q , q ) : q , q Q } is a reset set for the network layers.
Note that in the conventional hybrid/switched systems and control theory, the index set Q in Definition 1 is sometimes called a “set of locations”. A “location” of the system under consideration is specified by a concrete variable q i Q .
The switched dynamic system from Definition 1 is assumed to be determined on an interval [ 0 , T ] . A concrete sequence l of switching times
t i , i = 1 , , I
defines a partition of the interval [ 0 , T ] by some adjoint subintervals [ t i 1 , t i ) associated with every hybrid layer index
q i Q , i = 1 , , I .
We now use the results of [11,12,14] and rewrite the complete dynamics of a switched system from Definition 1 in the following compact form:
y s , τ = i = 1 I β [ t i 1 , t i ) ( τ ) f τ q i ( y s , τ 1 , θ τ ) , y s , 0 X 0 , θ τ Θ τ , q i Q , τ = 1 , , T , s = 1 , , S , i = 1 , , I .
Here, β [ t i 1 , t i ) ( · ) , i = 1 , , I is the characteristic function of the interval [ t i 1 , t i ) .
The concept of a switched dynamic system from Definition 1 applied to the HDNNS model (5) leads to a constructive (non-combinatorial) representation of the control variable ρ Q I in system (5) and in the main OCP (6). The following vector-function
β ( · ) : = β [ t i 1 , t i ) ( · ) i = 1 , , I
makes it possible for a function-based representation of the combinatorial control variable ρ Q I . We next define a set B of all admissible characteristic functions β ( · ) for the discrete-time switched system (10):
B : = { β ( · ) | β [ t i 1 , t i ) ( τ ) { 0 , 1 } I , i = 1 I β [ t i 1 , t i ) ( τ ) = 1 } ,
where τ = 1 , , T . From Definition 1, it is easy to see that set B is in a one-to-one correspondence to the set of all admissible sequences l of switching times. These switching times correspond to the possible changes in the hybrid MLP network structure according to the main optimal structuring problem that we consider.
As mentioned above, the developed switched system approach to the HDNNS makes it possible to replace the control variable ρ Q I with an equivalent expression of β ( · ) in system (5). Let y s , T ϑ , β be a solution of the switched dynamic system (10) for a concrete selection of the control input
( ϑ , β ( · ) ) Θ B .
Using the switched system’s formalism, we now rewrite the main OCP (6) in the following form:
minimize J ( ϑ , β ( · ) ) subject to y s , τ = i = 1 I β [ t i 1 , t i ) ( τ ) f τ q i ( y s , τ 1 , θ τ ) , y s , 0 Y 0 , θ τ Θ τ , β ( · ) B , q i Q , τ = 1 , , T , s = 1 , , S , i = 1 , , I .
The objective functional J ( · , · ) in Equation (11) can be rewritten as
J ( ϑ , β ( · ) ) : = 1 S s = 1 S Ψ s ( y s , T ϑ , β ) .
Evidently, the objective J ( ϑ , β ( · ) ) in (12) is a combined functional.
For the given hybrid MLP network model, the state y s , T ϑ , β represents the final output of the neural network learning process. In accordance with the originally given OCP (6), we next assume that the switched-type problem (11) possesses an optimal solution
( ϑ o p t , β o p t ( · ) ) Θ B .
The corresponding optimal solution of the switched dynamic system (10) is next defined by y s , τ o p t .
The switched-type problem (11) can be considered as a “constructive” version of the initially presented OCP stated in (6). It contains a numerically tractable formalization β ( · ) B of the combinatorial control component ρ Q I in (11). Similarly to the case of the original OCP regularization (9), we also introduce a regularized version J reg ( · , · ) of the objective functional J ( · , · ) of the problem in (11):
J r e g ( ϑ , β ( · ) ) : = J ( ϑ , β ( · ) ) + 1 S s = 1 S τ = 1 T R τ ( θ τ ) .
The selection of a concrete regularization functional R τ ( · ) in (13) can be made in the same way as in problem (9).

4. A Critical Analysis of the Necessity of Optimality Conditions of Pontryagin Type

This section is devoted to the necessary optimality conditions for the switched-type OCP in Equation (11). These optimality conditions are given in the form of a Pontryagin Maximum Principle (PMP) for the general optimal control processes governed by switched dynamics. We refer to [15,16] for the necessary technical details, examples, and analytical results. The applicability analysis of the PMM for a concrete class of OCPs (11) involving the switched control systems (10) shows the inconsistency of a possible application of this PMP. As will be shown in this section, the switched systems’ approach to the hybrid feedforward network learning involves the non-effectiveness of the celebrated PMP for problem (6). Since OCP (11) constitutes a mathematically equivalent representation of the hybrid MLP neural network learning problem (6), the same inconsistency conclusion will be true for the initially given OCP (6). This important fact provides the main motivation for the further development of some alternative solution schemes for the network learning problem under consideration.
Recall that the celebrated PMP expresses the necessary optimality conditions in many types of the conventional, hybrid, and switched-type OCPs. Consider now the switched-type OCP (6) and introduce the corresponding Hamiltonian function:
H : N × R N τ 1 × R N τ × Θ τ × B R , H ( τ , y , p , θ , β ( · ) ) : = p , i = 1 I β [ t i 1 , t i ) ( τ ) f τ q i ( y , θ ) R N τ
By p R N τ , we denote in (14) the generic adjoint variable, and · , · Z is a scalar product in an Euclidean space Z. For every location
q i Q , i = 1 , , I
of the switched control system (10), we can define the corresponding partial Hamiltonian:
H q i ( τ , y , p , θ , β [ t i 1 , t i ) ( · ) ) : = p , β [ t i 1 , t i ) ( τ ) f τ q i ( y , θ ) R N τ .
Evidently,
H ( τ , y , p , θ , β ( · ) ) = i = 1 I H q i ( τ , y , p , θ , β [ t i 1 , t i ) ( · ) ) .
Recall that the classical PMP for a discrete-time OCP expresses the necessary optimality condition in terms of a Hamiltonian function. We now present an advanced version of the optimality conditions, namely, the PMP for the concrete discrete-time switched OCP (11).
Theorem 1.
Consider the discrete-time switched OCP (11) and assume that it has an optimal solution
( ϑ o p t , β o p t ( · ) ) Θ B
and is Lagrange-regular. Assume that the right-hand side f τ q in HDNSN (5) is given by (4), and the concrete function Ψ s in (12) is defined by (8). Let the classifier C be a continuously differentiable function. Assume further that for each τ = 1 , , T and y R N τ 1 , the orientor field
F : = { i = 1 I β [ t i 1 , t i ) ( τ ) f τ q i ( y , θ ) : ( ϑ , β ( · ) ) Θ B }
is convex. Then, there exists an adjoint process
P τ : = { p s , τ R N τ , s = 1 , . . . , S } , τ = 1 , , T
such that the following relations are satisfied for s = 1 , . . . , S and τ = 1 , , T :
y s , τ o p t = i = 1 I p H q i ( τ , y s , τ 1 o p t , p s , τ , θ τ , o p t , β [ t i 1 , t i ) o p t ( · ) ) , y s , 0 o p t Y 0 , p s , τ 1 = i = 1 I y H q i ( τ , y s , τ 1 o p t , p s , τ , θ τ , o p t , β [ t i 1 , t i ) o p t ( · ) ) , p s , T = 1 S y Ψ s ( y s , T o p t ) , s = 1 T i = 1 I H q i ( τ , y s , τ 1 o p t , p s , τ , θ τ , o p t , β [ t i 1 , t i ) o p t ( · ) ) s = 1 T i = 1 I H q i ( τ , y s , τ 1 o p t , p s , τ , θ , β [ t i 1 , t i ) ( · ) ) ( ϑ , β ( · ) ) Θ B .
We refer to [11,17], and some references therein, for the formulation and proof of the PMP for OCP (11). The classic PMP for the conventional discrete-time OCPs can be found in [16]. Let us note that the derivative
y Ψ s ( y s , T o p t )
in (16) can be explicitly calculated in the framework of an MLP network. One uses the concrete expression (8) of function Ψ s ( · ) for this purpose. Moreover, in the case of a regularized functional J reg ( ϑ , β ( · ) ) determined by (13), the corresponding Hamiltonian of the resulting regularized OCP can be defined as follows:
H r e g ( τ , y , p , θ , β ( · ) ) : = H ( τ , y , p , θ , β ( · ) ) 1 S R τ ( θ ) = p , i = 1 I β [ t i 1 , t i ) ( τ ) f τ q i ( y , θ ) R N τ 1 S R τ ( θ ) .
Evidently, in this case, the regularization functional R τ ( · ) only appears in the last relation (inequality) of the generic system (16) of optimality conditions.
Versions of the PMP related to concrete OCPs provide a theoretical foundation for an important class of numerical methods in optimal control. This class involves the so-called indirect computational methods for OCPs. We refer to [11,12] for an overview. However, the PMP mentioned above includes a specific assumption, namely, the convexity condition of the orientor field F . This convexity assumption constitutes the most crucial condition of Theorem 1. A possible violation of this assumption will imply an incorrectness of the PMP optimality conditions for the specific OCP (11). The same is also true with respect to the initially given hybrid MLP neuronal network learning problem (6). Note that in the case of the regularized OCP with the cost functional J reg ( ϑ , β ( · ) ) and with the correspondingly regularized Hamiltonian H reg ( τ , y , p , θ , β ( · ) ) , the conditions of Theorem 1 needs to be extended by the additional convexity assumption of the regularization function R τ ( · ) (see, e.g., [11]).
It is necessary to stress that the convexity assumption for the orientor field F has no relations to the convexity of the right-hand side of the switched dynamic system (10). We refer to [11,15,16] for various examples of the specific convex and non-convex orientor fields.
In the case of MLS trainable layers, the functions
f τ q ( · , · ) F
have a quasi-affine structure with respect to
θ τ Θ t
(see (4)). This special case includes many important types of layers, for example, the fully connected layers, convolution layers, and batch normalization layers (see, e.g., [6,7]). One can assume that the convexity of the parameter is set as Θ . However, the non-convexity of the set B implies the non-convexity of the complete (product) control set
Θ B
even in the case of a convex Θ τ . This above property, the nonlinearity of the activation function σ τ q and the specific bilinear (non-convex) structure of the summands
β [ t i 1 , t i ) ( τ ) f τ q i ( x , θ )
in the right-hand side of (10) exclude the necessary convexity property of the orientor field F required in Theorem 1. When F is a non-convex set then, in general, it is not true that the PMP from Theorem 1 constitutes the necessary optimality conditions for the switched-type OCP (11).
We now discuss an additional counter-argument for a possible consideration of the PMP (Theorem 1) for the numerical treatment of the switched OCP (11). It is well known that the necessary optimality condition in the form of a PMP provides a numerically consistent algorithm in a full space (for example, in a Lebesgue space) of admissible control functions. On the other hand, a direct application of the corresponding PMP to problem (11) does not guarantee admissibility of the numerical optimal solution ( ϑ * , β * ( · ) ) in the sense of problem (11), i.e., it does not guarantee the required condition:
β * ( · ) B .
The above counter-argument shows that the celebrated PMP and the corresponding computational solution procedures cannot be applied directly to the specific problem (11). The “admissibility problem” mentioned above is a direct consequence of the following pivotal theoretic observation related to a formal mathematical proof of the PMP: the set of possible needle variations associated with the characteristic functions β [ t i 1 , t i ) ( · ) in problem (11) constitutes a very “poor” set of variations (see, e.g., [15]). As a consequence, one cannot derive the generic adjoint equation that guarantees the belonging of the numerically optimal value β * ( · ) to the set of B admissible switched controls. The possible numerical application of the optimality condition (system) (16), including the adjoint equation, namely, the difference equation for variable p s , τ 1 in (16), generally implies the inadmissibility β * ( · ) B of the resulting numerically optimal variable β * ( · ) . Since B is a non-convex set, we cannot effectively apply any suitable projection approach for β * ( · ) .
The two main critical arguments discussed above reflect the conceptual applicability problem of the generic PMP in the context of numerical treatments of the OCP (11). This fact significantly restricts the possible numerical application of the presented Theorem 1 and the corresponding indirect solution algorithms. This situation provides a main motivation for the development of novel, direct-solution techniques for OCPs associated with the optimal learning of hybrid fully connected neural networks.

5. A Reduced-Gradient Approach to the Switched-Type OCP for Hybrid Neural Networks

This section presents a novel solution scheme for the switched-type OCP (11) related to the optimal structuring problem for a hybrid MLP neural network. As mentioned in Section 5, the possible application of the indirect (PMP-based) numerical method for the switched-type OCP (11) involves some conceptual difficulties. On the other hand, the specific structure of the optimization problem under consideration makes it possible to derive a constructive expression for the so-called “reduced gradient” of the objective functional J ( · , · ) in OCP (11). The explicit characterization of the reduced gradient discussed in the next theorem will be used as an analytic basis for a specific first-order solution algorithm that we propose. This algorithm can be applied directly to the original switched OCP (11) as well as to the regularized version of the problem.
Consider OCP (11) associated with the given hybrid neural network of the MLP type with the continuously differentiable functions Ψ s ( · ) from (8). Since the concrete functions f τ q ( · , · ) in (3) are also continuously differentiable, we derive the Fréchet differentiability of the objective functional J ( · , · ) in OCP (11). The corresponding gradient of J ( · , · ) is next denoted by J ( · , · ) .
We now apply the generalized reduced-gradient method that is comprehensively discussed in [11,12] to the concrete OCP (11) and obtain the following formal results.
Theorem 2.
Consider the switched OCP (11) and assume that all conditions of Theorem 1 are satisfied. The reduced gradient J ( · , · ) of the objective functional J ( · , · ) in problem (11) can be computed as a solution of the following system of equations:
y s , τ = p H ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) , y s , 0 Y 0 , p s , τ 1 = y H ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) , p s , T = 1 S Ψ s ( y s , T ) , J ( θ , β ( · ) ) = θ H ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) β H ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) , τ = 1 , . . . , T , s = 1 , , S , i = 1 , , I , q i Q ,
where p s , τ R N τ , τ = 1 , , T , s = 1 , , S is an adjoint variable.
By θ H and β H , we denote here the partial derivative of H with respect to θ and β , respectively. A complete proof of Theorem 2 in the case of a fixed s = 1 , , S can be found in [11,12]. Some similar results for concrete classes of discrete-time and continuous-time OCPs are obtained in [17].
Using expression (14) for the system Hamiltonian H ( τ , y , p , θ , β ( · ) ) , we can calculate the elements θ H and β H of the gradient vector J ( θ , β ) in (17):
θ T H ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) = p s , τ , i = 1 I β [ t i 1 , t i ) ( τ ) θ f τ q i ( y s , τ 1 , θ τ ) R N τ ,
and
β T H ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) = p s , τ , f τ q 1 ( y s , τ 1 , θ τ ) R N τ , , p s , τ , f τ q I ( y s , τ 1 , θ τ ) R N τ .
By θ T and β T , we denote here the transpose operation. Let us note that a detailed calculation of the partial derivative
θ T H ( τ , y , p , θ , β ( · ) )
depends on a concrete selection of an activation function σ τ q in (4).
Consider now the regularized functional J r e g ( · , · ) from (13) and the corresponding regularized OCP (11). We assume that the regularizing functions
R τ ( · ) , τ = 1 , . . . , T
are continuously differentiable. In this case, the Hamiltonian H ( τ , y , p , θ , β ( · ) ) introduced in (14) needs to be replaced by its regularized version that was introduced in the previous section:
H r e g ( τ , y , p , θ , β ( · ) ) .
Using this replacement, the reduced gradient J r e g ( · , · ) in the regularized OCP can be found from the corresponding extension of the basic system of Equation (17) from Theorem 2. Let us formulate the corresponding result for a regularized OCP (11).
Theorem 3.
Consider a regularized version of the switched OCP (11) and assume that all conditions of Theorem 1 are satisfied. The reduced gradient J r e g ( · , · ) of the objective functional J r e g ( · , · ) can be computed as a solution of the following system of equations:
y s , τ = p H r e g ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) , y s , 0 Y 0 , p s , τ 1 = y H r e g ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) , p s , T = 1 S Ψ s ( y s , T ) , J r e g ( θ , β ( · ) ) = θ H r e g ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) β H r e g ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) = 1 S R τ ( θ τ ) 0 R I θ H ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) β H ( τ , y s , τ 1 , p s , τ , θ τ , β ( τ ) ) , τ = 1 , , T , s = 1 , , S , i = 1 , , I , q i Q ,
where 0 R I denotes the zero vector in the Euclidean space R I .
The results of Theorems 2 and 3, namely, the constructive expressions for the reduced gradients in the originally given OCP (11) and in the regularized problem, make it possible to consider various first-order gradient-based computational techniques for a numerical treatment of the switched OCP (11). Let us present here the generic reduced-gradient algorithm for problem (11):
initial step : θ β ( · ) ( 0 ) Θ B ; iterations : θ β ( · ) ( k + 1 ) = θ β ( · ) ( k ) γ k J ( θ ( k ) , β ( k ) ( · ) ) , k = 0 , 1 , inclusion condition : θ β ( · ) ( k + 1 ) Θ B .
Here, k = 0 , 1 , is the iteration index. Clearly, for the objective functional J reg ( · , · ) from the regularized version of problem (11), the gradient in (19) needs to be replaced by the constructive expression of
J reg ( θ ( k ) , β ( k ) ( · ) )
from Theorem 3 (see (19)).
We now study the numerical stability of the basic gradient method (19). Recall that the convergence properties of the generic reduced-gradient method for the switched system optimization was comprehensively discussed by many authors (see, e.g., [11,17,18] and references therein). Concretely, we can prove the following special convergence result for the reduced-gradient algorithm (19).
Theorem 4.
Consider OCP (11) and assume that all conditions of Theorem 1 are satisfied. Let
θ β ( · ) ( k ) k = 0 , 1 ,
be a sequence generated by method (19). Then, for an admissible initial point
θ β ( · ) ( 0 ) Θ B ,
the above sequence (20) is a minimizing sequence for problem (11), i.e.,
lim k J ( ϑ , β k ( · ) ) = J ( ϑ o p t , β o p t ( · ) ) .
Moreover, this sequence converges weakly to a solution ( ϑ o p t , β o p t ( · ) ) of the switched-type OCP (11).
Proof. 
The property of sequence (20) to be a minimizing sequence for problem (11) immediately follows from [9]. Moreover, the concrete functions
f τ q ( y s , τ 1 , θ τ ) F ,
where F is defined in (3), are Lipschitz continuous and possess the Lipschitz continuous derivative for all q Q . Since the composition of two Lipschitz continuous mappings involves the same property for the resulting mapping, we obtain the Lipschitz continuity of the gradient J ( θ , β ( · ) ) . The weak convergence of sequence (20) generated by method (19) to the optimal pair
( ϑ o p t , β o p t ( · ) )
follows now from [19]. The proof is completed. □
Let us make some observations related to Theorem 4. Since ϑ Θ and Θ is a subset of a finite-dimensional Euclidean space (see Section 2), the weak convergence of the first component ϑ in (20) to ϑ o p t coincides with the norm convergence in this Euclidean space. The weak convergence of the second component β ( · ) in (20) to β o p t ( · ) is in fact a weak convergence in the Hilbert space of all square integrable functions of the corresponding dimensionality.
Note that the inclusion condition
θ β ( · ) ( k + 1 ) Θ B
for the updated iterations in (19) can be implemented using a projection. Assume that set Θ is convex. In that case, one needs to consider the following convexification of the product set Θ B :
Θ conv { B } ,
where conv { B } is a convexification of the function set B . We refer to [11,20] for further theoretical and computational details related to the relaxation (convexification) procedures for gradient methods.
Finally, note that the proposed reduced-gradient algorithm provides an adequate alternative to the PMM-based optimality conditions critically discussed in Section 4. It constitutes a novel, OCP-based approach to the optimal structuring problem for a hybrid MLP neural network.

6. Concluding Remarks

In this paper, we developed a novel, optimal control-based approach to an optimization problem associated with the training of the feedforward fully connected hybrid neural network. We use the control theoretical approach and reformulate the main optimal structuring problem for a hybrid MLP neural network in the form of a specific switched-type OCP. This equivalent reformulation makes it possible to apply the advanced optimal control methodology to the training and structural optimization of a given hybrid MLP network. To be precise, we considered the optimal control of discrete-time switched systems and developed a reduced-gradient method for the practical treatment of the resulting switched OCP.
Let us note that the proposed equivalent representation of the trainable network design has a potential to be used in the optimal training of a convolution neural network. This optimistic conclusion follows from the general structure of the CNN (see, e.g., [21,22]) and from the existence of many well-developed theoretical results and computational methods for optimal control.
The equivalent reduction in a given sophisticated fully connected neural network design problem to a semiclassical OCP shows the effectiveness of the proposed optimal control methodology for neural network learning.
This is a preliminary proposal to optimize the connection weights of sequential layers of computing nodes to accomplish a mapping from a domain to a range, which is the task performed by a neural network. This model is capable of accommodating different activation functions at different layers, which prompted us to call it a hybrid (layer) neural network. This work constitutes initial theoretical research on the subject. The advanced optimal control approach discussed here is a complementary technique in the context of existing neural network training algorithms. To establish our approach as a viable alternative to established learning algorithms, we need to prototype and simulate the model. Comprehensive simulation studies for performance analysis and comparison with the traditional method of learning, both in the context of computational efficiency and quality of results, are needed. This is our future work.

Author Contributions

Conceptualization, G.C. and V.A.; Methodology, G.C.; Validation, L.A.G.T.; Formal analysis, V.A.; Writing—original draft, G.C.; Writing—review & editing, V.A., L.A.G.T. and G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
Y 0 The set of inputs of the supervised data, 0 is the label for input
N 0 The dimension of the input data
Y T The set of outputs of the supervised data, T is the label for output
N T The dimension of the output data
TThus, T N , the integer T is the number of layers,
as well as the output layer and the output data label
SThe number of training data s is used as the index of the training data
y s , 0 y s , 0 R N 0 , s = 1 , , S is the s t h  input data
y s , T y s , 0 R N 0 , s = 1 , , S is the s t h output data
τ Index for network layer number
N τ 1 Number of input nodes (dimension of the input) to τ t h layer
N τ Number of output nodes (dimension of the output) of τ t h layer
θ τ θ τ : = W τ , b τ , τ = 1 , , T ,
are the trainable parameters of the τ t h network layer
W τ The connection weights between input and output nodes of τ t h layer
b τ The connection weights of the bias node of τ t h layer
QA finite index set for a family of predefined computational models;
this allows the network architecture to accommodate hybrid layers
q i Q q i is a specific i t h option from Q = q 1 , , q i , , q I
I T , i.e., there could be layers of same computation models
ϕ τ q i The transfer function of τ t h layer using q i Q
ϕ τ + 1 q ϕ τ q The combined transfer function of two consecutive network layers
Layer τ ’s output is the input to layer τ + 1
Θ τ Θ τ R N τ 1 × N τ R N τ is the set of admissible parameters of θ τ
ϕ τ q The τ t h layer transfer operator, where R N τ 1 × Θ τ R N τ
f τ q f τ q : R N τ 1 × Θ τ R N τ The transfer function from input to output vector
σ τ q The activation of type q Q ; thus, f τ q ( y s , τ 1 , θ τ ) = σ τ q W τ y s , τ 1 + b τ
ϑ , ρ The variables for θ τ and q i , respectively
J ˜ The objective function for regularization expressed in Equation (6)

References

  1. Rumelhart, D.E.; McClelland, L.J. Parallel Distributed Processing Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1987. [Google Scholar]
  2. Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
  3. Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1997, 4, 251–257. [Google Scholar] [CrossRef]
  4. Zhong, S.; Cherkassky, V. Factors controlling generalization ability of MLP networks. In Proceedings of the IJCNN’99 (International Joint Conference on Neural Networks), Washington, DC, USA, 10–16 July 1999. [Google Scholar]
  5. Chakraborty, G.; Murakami, M.; Shiratori, N.; Noguchi, S. A Growing Network that optimizes between undertraining and Overtraining. In Proceedings of the IEEE International Conference on Neural Network, Perth, Australia, 27 November–1 December 1995; Volume 2, pp. 1116–1121. [Google Scholar]
  6. Benning, M.; Calledoni, E.; Ehrhardt, M.J.; Owren, B.; Schoenlieb, C.-B. Deep learning as optimal control problems: Models and numerical methods. J. Comput. Dyn. 2015, 6, 171–198. [Google Scholar] [CrossRef]
  7. Li, Q.; Hao, S. An optimal control approach to deep learning and applications to discrete-weight neural networks. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  8. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
  9. Bertsekas, D.P. Nonlinear Programming; Athena Scientific: Belmont, MA, USA, 1999. [Google Scholar]
  10. Galván-Guerra, R.; Azhmyakov, V.; Egerstedt, M. Optimization of multiagent systems with increasing state dimensions: Hybrid LQ approach. In Proceedings of the 2011 American Control Conference, San Francisco, CA, USA, 29 June–1 July 2011; pp. 881–887. [Google Scholar]
  11. Azhmyakov, V. A Relaxation Based Approach to Optimal Control of Switched Systems; Elsevier: Oxford, UK, 2019. [Google Scholar]
  12. Azhmyakov, V.; Basin, M.; Raisch, J. A proximal point based approach to optimal control of affine switched systems. Discret. Event Dyn. Syst. 2012, 22, 61–81. [Google Scholar] [CrossRef]
  13. Atlee Jackson, E. On the control of complex dynamic systems. In Physica D: Nonlinear Phenomena; Elsevier: Amsterdam, The Netherlands, 1991; Volume 50, pp. 341–366. [Google Scholar]
  14. Egerstedt, M.; Wardi, Y.; Axelsson, H. Transition-time optimization for switched systems. IEEE Trans. Autom. Control 2006, AC-51, 110–115. [Google Scholar]
  15. Boltyanski, V.; Poznyak, A. The Robust Maximum Principle; Birkhauser: New York, NY, USA, 2012. [Google Scholar]
  16. Halkin, H. A Maximum Principle of the Pontryagin type for systems described by nonlinear difference equations. SIAM J. Control 1966, 4, 90–111. [Google Scholar] [CrossRef]
  17. Teo, K.L.; Goh, C.J.; Wong, K.H. A Unifed Computational Approach to Optimal Control Problems; Wiley: New York, NY, USA, 1991. [Google Scholar]
  18. Polak, E. Optimization; Springer: New York, NY, USA, 1997. [Google Scholar]
  19. Goldstein, A.A. Convex programming in Hilbert space. Bull. Am. Math. Soc. 1964, 70, 709–710. [Google Scholar] [CrossRef]
  20. Roubicek, T. Relaxation in Optimization Theory and Variational Calculus; De Gruyter: Berlin, Germany, 1997. [Google Scholar]
  21. Saleh, A.M.; Hamoud, T. Analysis and best parameters selection for person recognition based on gait model using CNN algorithm and image augmentation. J. Big Data 2021, 8, 1. [Google Scholar] [CrossRef] [PubMed]
  22. Wang, J.; Lin, J.; Wang, Z. Efficient convolution architectures for convolutional neural network. In Proceedings of the 2016 8th International Conference on Wireless Communications and Signal Processing (WCSP), Yangzhou, China, 13–15 October 2016; pp. 1–5. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chakraborty, G.; Azhmyakov, V.; Guzman Trujillo, L.A. A Formal Approach to Optimally Configure a Fully Connected Multilayer Hybrid Neural Network. Mathematics 2025, 13, 129. https://doi.org/10.3390/math13010129

AMA Style

Chakraborty G, Azhmyakov V, Guzman Trujillo LA. A Formal Approach to Optimally Configure a Fully Connected Multilayer Hybrid Neural Network. Mathematics. 2025; 13(1):129. https://doi.org/10.3390/math13010129

Chicago/Turabian Style

Chakraborty, Goutam, Vadim Azhmyakov, and Luz Adriana Guzman Trujillo. 2025. "A Formal Approach to Optimally Configure a Fully Connected Multilayer Hybrid Neural Network" Mathematics 13, no. 1: 129. https://doi.org/10.3390/math13010129

APA Style

Chakraborty, G., Azhmyakov, V., & Guzman Trujillo, L. A. (2025). A Formal Approach to Optimally Configure a Fully Connected Multilayer Hybrid Neural Network. Mathematics, 13(1), 129. https://doi.org/10.3390/math13010129

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop