You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

12 September 2023

Analysis of Function Approximation and Stability of General DNNs in Directed Acyclic Graphs Using Un-Rectifying Analysis

and
Institute of Information Science, Academia Sinica, Taipei 115, Taiwan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue New Insights and Techniques for Neural Networks

Abstract

A general lack of understanding pertaining to deep feedforward neural networks (DNNs) can be attributed partly to a lack of tools with which to analyze the composition of non-linear functions, and partly to a lack of mathematical models applicable to the diversity of DNN architectures. In this study, we analyze DNNs using directed acyclic graphs (DAGs) under a number of basic assumptions pertaining to activation functions, non-linear transformations, and DNN architectures. DNNs that satisfy these assumptions are referred to as general DNNs. Our construction of an analytic graph was based on an axiomatic method in which DAGs are built from the bottom–up through the application of atomic operations to basic elements in accordance with regulatory rules. This approach allowed us to derive the properties of general DNNs via mathematical induction. We demonstrate that the proposed analysis method enables the derivation of some properties that hold true for all general DNNs, namely that DNNs “divide up” the input space, “conquer” each partition using a simple approximating function, and “sparsify” the weight coefficients to enhance robustness against input perturbations. This analysis provides a systematic approach with which to gain theoretical insights into a wide range of complex DNN architectures.

1. Introduction

Deep feedforward neural networks (DNNs) have revolutionized the use of machine learning in many fields, such as computer vision and signal processing, where they have been used to resolve ill-posed inverse problems and sparse recovery problems [1,2]. Much of the previous research in the field of deep learning literature describes the construction of neural networks capable of attaining a desired level of performance for a given task. However, researchers have yet to elucidate several fundamental issues that are critical to the function of DNNs. Predictions that are based on a non-explainable and non-interpretable models raise trust issues pertaining to the deployment of that neural network in practical applications [3]. This lack of understanding can be attributed, at least partially, to a lack of tools with which to analyze the composition of non-linear activation functions in DNNs, as well as a lack of mathematical models applicable to the diversity of DNN architectures. This paper reports on a preliminary study of fundamental issues pertaining to function approximation and the inherent stability inherent of DNNs.
Simple series-connected DNN models, such as N L s ( x ) = ϱ L M L ϱ 1 M 1 ( x ) , have been widely adopted for analysis [4,5,6]. According to this model, the input domain is partitioned into a collection of polytopes in a tree-like manner. Each node of the tree can be associated with a polytope and one affine linear mapping, which gives a local approximation of an unknown target function with a domain restriction on the polytope [7,8]. We initially considered whether the theoretical results derived in those papers were intrinsic/common to all DNNs or unique to this type of network. However, our exploration of this issue was hindered by problems encountered in representing a large class of DNN architectures and in formulating the computation spaces for activation functions and non-linear transformations. In our current work, we addressed the latter problem by assuming that all activation functions can be expressed as networks with point-wise continuous piecewise linear (CPWL) activation functions and that all non-linear transforms are Lipschitz functions. The difficulties involved in covering all possible DNNs prompted us to address the former problem by associating DNNs with graphs that can be described in a bottom–up manner using an axiomatic approach, thereby allowing for analysis of each step in the construction process. This approach made it possible for us to build complex networks from simple ones and derive their intrinsic properties via mathematical induction.
In the current study, we sought to avoid the generation of graphs with loops by describing DNNs using directed acyclic graphs (DAGs). The arcs are associated with basic elements that correspond to operations applied to the layers of a DNN (e.g., linear matrix, affine linear matrix, non-linear activation function/transformation), while nodes that delineate basic elements are used to relay and reshape the dimension of an input or combine outputs from incoming arcs to outgoing arcs. We refer to DNNs that can be constructed using the proposed axiomatic approach as general DNNs. It is unclear whether general DNNs are equivalent to all DNNs that are expressible using DAGs. Nevertheless, general DNNs include modules widely employed in well-known DNN architectures. The proposed approach makes it possible to extend the theoretical results for series-connected DNNs to general DNNs, as follows:
  • A DNN DAG divides the input space via partition refinement using either a composition of activation functions along a path or a fusion operation combining inputs from more than one path in the graph. This makes it possible to approximate a target function in a coarse-to-fine manner by applying a local approximating function to each partitioning region in the input space. Furthermore, the fusion operation means that domain partition tends not to be a tree-like process.
  • Under mild assumptions related to point-wise CPWL activation functions and non-linear transformations, the stability of a DNN against local input perturbations can be maintained using sparse/compressible weight coefficients associated with incident arcs to a node.
Accordingly, we can conclude that a general DNN “divides” the input space, “conquers” the target function by applying a simple approximating function over each partition region, and “sparsifies” weight coefficients to enhance robustness against input perturbations.
In the literature, graphs are commonly used to elucidate the structure of DNNs; however, they are seldom used to further the analysis of DNNs. Both graph DNNs [9] and the proposed approach adopt graphs for analysis; however, graph DNNs focus on the operations of neural networks in order to represent real-world datasets in graphs (e.g., social networks and molecular structure [10]), whereas our approach focuses on the construction of analyzable graph representations by which to deduce intrinsic properties of DNNs.
The remainder of this paper is organized as follows. In Section 2, we present a review of related works. Section 3 outlines our bottom–up axiomatic approach to the construction of DNNs. Section 4 outlines the function approximation and stability of general DNNs. Concluding remarks are presented in Section 5.
Notation 1.
Matrices are denoted using bold upper case, and vectors are denoted using bold lower case. We also use x i to denote the i-th entry of a vector ( x R n ), x 2 to denote its Euclidean norm, and diag ( x ) to denote a diagonal matrix with diagonal x .

3. DNNs and DAG Representations

The class of DNNs addressed in this study is defined by specific activation functions, non-linear transformations, and its underlying architecture. Note that legitimate activation functions, non-linear transformations, and architectures should be analyzable and provide sufficient generalizability to cover all DNNs in common use.
Activation functions and non-linear transformations are both considered functions; however, we differentiate between them due to differences in the way they are treated under un-rectifying analysis. A non-linear transformation is a function in the conventional sense when mapping R n to R m , in which different inputs are evaluated using the same function. This differs from activation functions, in which different inputs can be associated with different functions. For example, for ReLU R R , un-rectifying considers ReLU x = d x where d { 0 , 1 } as two functions depending on whether x > 0 where d = 1 or x 0 where d = 0 .

3.1. Activation Functions and Non-Linear Transformations

In this paper, we focus on activation functions that can be expressed as networks of point-wise CPWL activation functions ( ρ ). Based on this assumption and the following lemma, we assert that the activation functions of concern are ReLU networks.
Lemma 1
([24]). Any point-wise CPWL activation function ( ρ : R R ) of m pieces can be expressed as follows:
ρ ( x ) = i = 1 m r i ReLU ( x a i ) + l i ReLU ( t i x ) = i I + r i ReLU ( x a i ) + i I l i ReLU ( t i x )
where l i and r i indicate the slopes of segments and a i and t i are breakpoints of the corresponding segments.
Note that the max-pooling operation (arguably the most popular non-linear pooling operation), which outputs the largest value in the block and maintains the selected location [25], is also a ReLU network. The max pooling of a block of any size can be recursively derived by max pooling a block of size 2. For example, let max 4 and max 2 denote the max pooling of blocks of sizes 4 and 2, respectively. Then, max 4 ( x 1 , x 2 , x 3 , x 4 ) = max 2 ( max 2 ( x 1 , x 2 ) , max 2 ( x 3 , x 4 ) ) and max 5 ( x 1 , x 2 , x 3 , x 4 , x 5 ) = max 2 ( max 4 ( x 1 , x 2 , x 3 , x 4 ) , x 5 ) . The max pooling of a block of size 2 can be expressed as follows:
max 2 ( x = [ x 1 x 2 ] ) = x 1 + x 2 2 + | x 1 x 2 | 2 = 1 2 [ 1 1 1 ] ρ ˜ 1 1 1 1 1 1 x ,
where ρ ˜ R 3 × 3 is
ρ ˜ = 1 0 0 0 ReLU 0 0 0 ReLU .
In this paper, we make two assumptions pertaining to activation function ρ :
(A1)
ρ : R R can be expressed as (16).
This assumption guarantees that for any input ( x ), the layer of ρ : R l R l can be associated with diagonal un-rectifying matrix D x ρ with real-valued diagonal entries, where the value of the l-th diagonal entry is i I l + r l , i + j I l l l , j , while I l + and I l denote the sets in which ReLUs are active.
(A2)
There exists a bound that d ρ > 0 for any activation function ( ρ ) regardless of input ( x ). This corresponds to the assumption that
D x ρ 2 max l i I l + | r l , i | + j I l | l l , j | d ρ
This assumption is used to establish the stability of a network against input perturbations.
The outputs of a non-linear transformation layer can be interpreted as coefficient vectors related to that domain of the transformation. We make the following assumption pertaining to non-linear transformation ( σ ) addressed in the current paper.
(A3)
There exists a uniform Lipschitz constant bound ( d σ > 0 ) with respect to 2 norm for any non-linear transformation function ( σ ) and any inputs ( x and y ) in R n :
σ ( x ) σ ( y ) 2 d σ x y 2 .
This assumption is used to establish the stability of a network against input perturbations. Sigmoid and tanh functions are 1-Lipschitz [26]. The softmax layer from R n to R n is defined as x i exp λ x i j = 1 n exp λ x j , where λ is the inverse temperature constant. The output is the estimated probability distribution of the input vector in the simplex of R n and i = 1 , , n . Softmax function Softmax λ persists as λ -Lipschitz [27], as follows:
Softmax λ ( x ) Softmax λ ( y ) 2 λ x y 2 .

3.2. Proposed Axiomatic Method

Let K denote the class of DNNs that can be constructed using the following axiomatic method with activation functions that satisfy (A1) and (A2) and non-linear transformations that satisfy (A3). This axiomatic method employs three atomic operations (O1–O3), the basic set B K , and a regulatory rule (R) describing a legitimate method by which to apply an atomic operation to elements in K in order to yield another element in K .
Basis set B comprises the following operations:
B = { I , L , M , Γ ρ , ρ M , Γ σ , σ M } ,
where I denotes the identify operation; L denotes any finite dimensional linear mapping with bounded spectral norms; M = ( L , b ) denotes any affine linear mapping, where L and b refer to the linear and bias terms, respectively; Γ ρ denotes activation functions satisfying (A1) and (A2); ρ M denotes functions with ρ Γ ρ ; Γ σ denotes non-linear transformations satisfying (A3); and σ M denotes functions with σ Γ σ .
Assumptions pertaining to Γ ρ and Γ σ are combined to obtain the following:
(A) The assumption of uniform bounding is based on the existence of a uniform bound, where > d > 0 for any activation function ( ρ Γ ρ ), any input ( x ), and any non-linear transformation ( σ Γ σ ), such that
d max { d ρ , d σ } .
Let χ denote the input space for any elements in B . The results of the following atomic operations belong to K . The corresponding DAG representations are depicted in Figure 3 (a reshaping of input or output vectors is implicitly applied at nodes to validate these operations).
Figure 3. Graphical representation of atomic operations O1–O3, where functions attached to arcs of concatenation and duplication are identify operation I (omitted for brevity).
O1.
Series connection (∘): We combine B and K by letting the output of k 1 K be the input of k 2 B , where
B K : k 2 k 1 : x k 2 ( k 1 x ) .
O2.
Concatenation: We combine multichannel inputs x i χ into a vector as follows:
concatenation : { x 1 , , x m } [ x 1 x m ] .
O3.
Duplication: We duplicate an input to generate m copies of itself as follows:
duplication : x χ I I x .
From the basic set and O1–O3, regulatory rule R generates other elements in K by regulating the application of atomic operations on K . The aim of R is to obtain DNNs that can be represented as DAGs; therefore, this rule precludes the generation of graphs that contain loops.
R.
DAG closure: We apply O1–O3 to K in accordance with
R : K K .
The DAGs of K comprise nodes and arcs, each of which belongs to one of operations O1–O3. Rule R mandates that any member in K can be represented as a DAG, in which arcs are associated with members in B and nodes coordinate the inlets and outlets of arcs. The rule pertaining to the retention of DAGs after operations on DAGs is crucial to our analysis based on the fact that nodes in a DAG can then be ordered (see Section 4). To achieve a more comprehensive understanding, we use figures to express DAGs. Nevertheless, a formal definition of a DAG must comprise triplets of nodes, arcs, and functions associated with arcs. An arc can be described as ( v i , k , v o ) , where v i and v o refer to the input and output nodes of the arc, respectively, and k B is the function associated with the arc.
We provide the following definition for the class of DNNs considered in this paper.
Definition 1.
General DNNs (denoted as K ) are DNNs constructed using the axiomatic method involving point-wise CPWL activation functions and non-linear transformations, which, together, satisfy assumption (A).
Note that DNNs comprise hidden layers and an output layer. For the remainder of this paper, we do not consider the output layers in DNNs because adding a layer of continuous output functions does not alter our conclusion.

3.3. Useful Modules

Generally, the construction of a DNN network is based on modules. Below, we illustrate some useful modules in pragmatic applications of general DNNs.
(1) MaxLU module: Pooling is an operation that reduces the dimensionality of an input block. The operation can be linear or non-linear. For example, average pooling is linear (outputs the average of the block), whereas max pooling is non-linear. Max pooling is usually implemented in conjunction with the ReLU layer (i.e., maxpooling ∘ ReLU) [28] to obtain a MaxLU module. The following analysis is based on the MaxLU function of block size 2 using un-rectification, where the MaxLU function of another block size can be recurrently derived using the MaxLU of block size 2 and analyzed in a similar manner [5]. The MaxLU 2 : R 2 R is a CPWL activation function that partitions R 2 into three polygons. By representing the MaxLU layer with un-rectifying, we obtain the following:
MaxLU 2 ( x = [ x 1 x 2 ] ) = [ 1 1 ] D x σ x .
where D x σ R 2 × 2 is a diagonal matrix with entries { 0 , 1 } and
diag ( D x σ ) = [ 1 0 ] when x 1 x 2 and x 1 > 0 [ 0 1 ] when x 2 > 0 and x 2 > x 1 [ 0 0 ] otherwise .
Figure 4 compares the domain partition of max pooling, ReLU, and MaxLU 2 .
Figure 4. Comparisons of max pooling, ReLU, and MaxLU 2 partitioning of R 2 , where vectors in (b,c) are diagonal vectors of the un-rectifying matrices of ReLU and MAXLU layers and dashed lines denote open region boundaries: (a) max 2 partitions R 2 into two regions; (b) ReLU partitions R 2 into four polygons; (c) MaxLU 2 partitions R 2 into three polygons. Note that the region boundary in the third quadrant of (a) is removed in (c).
(2) Series module: This module is a composition of K and K (denoted as K K ), which differs from operation O1 ( B K ).
The series-connected network ( N L s ( x ) = ϱ L M L ϱ 1 M 1 ( x ) ) is derived using a sequence of series modules. We consider that ρ i is a ReLU. The theoretical underpinnings of this module involving MaxLU activation functions can be found in [5]. We first present an illustrative example of N 2 s , then extend our analysis to N L s . Note that input space χ is partitioned into a finite number of regions using ( ReLU 1 ) M 1 , where M 1 is an affine mapping. The partition is denoted as P 1 . The composition of ( ReLU 2 ) M 2 and ( ReLU 1 ) M 1 (i.e., ReLU 2 M 2 ReLU 1 M 1 ) refines P 1 such that the resulting partition can be denoted as P 2 . Figure 5 presents a tree partition of R 2 using ( ReLU 2 ) M 2 ( ReLU 1 ) : R 2 R 1 , where M 2 x = w 1 x 1 + w 2 x 2 + b , in which x = [ x 1 x 2 ] , w 1 , w 2 0 , and b 0 . The affine linear functions over P 2 can be expressed as D 2 M 2 D 1 R 2 , where D 1 and D 2 indicate un-rectifying matrices of ReLU 1 and ReLU 2 using x and M 2 ReLU 1 x as inputs, respectively.
Figure 5. (Top-left) R 2 partitioned using ReLU 1 and (bottom-left) refined using ReLU 2 M 1 . (Right) Tree partition on R 2 from composition of ReLUs, where the function with domain over a region is indicated beneath the leaf and regions are named according to vectors obtained by stacking diagonal elements of the un-rectifying matrix of ReLU 1 over the un-rectifying matrix of ReLU 2 .
For a series-connected network ( N i s ), we let ( P i , A i ) denote the partition and corresponding functions. The relationship between ( P L , A L ) and ( P L 1 , A L 1 ) is presented as follows.
Lemma 2
([5]). Let ( P i = { P i , k } , A i = { A i , k } ) denote the partition of input space χ and the collection of affine linear functions of N i s . Furthermore, let the domain of the affine linear function ( A i , k ) be P i , k . Then,
(i) 
P L refines P L 1 (any partition region in P L can be subsumed to one and only one partition region in P L 1 ).
(ii) 
The affine linear mappings of N L s = ReLU L M L N L 1 s can be expressed as A L χ = D L M L A L 1 χ , where D L is an un-rectifying matrix of ReLU L . This means that if A L , i A L , then there must be a j in which P L , i P L 1 , j , such that A L , i P L , i = D L , i M L A L 1 , j P L , i , where the un-rectifying matrix ( D L , i ) depends on M L A L 1 , j P L , i .
P L refines P L 1 (i.e., P L P L 1 ), which means that if a directed graph is based on the partition refinement in the leftmost subfigure of Figure 6, then the node corresponding to P L is the only child node of node P L 1 , as well as the only parent node of node P L . The graph of this module is a tree with precisely one child at each node.
Figure 6. Graph based on partition refinement in which nodes denote partitions and arc a b denotes that the partition of node b is a refinement of the partition of node a. (Left) Graph of a serial module; (middle) graph of a parallel module; (right) graph of a fusion module.
(3) Parallel module (Figure 7a): This module is a composition comprising an element in K to each output of the duplication operation, denoted as follows:
( parallel ) K ( duplication ) .
Figure 7. Graphical representations of (a) a parallel module (which becomes a duplication operation when k i = I ) and (b) a fusion module (which becomes a concatenation operation when L = I and k i = I ).
When expressed in matrix form, we obtain the following:
x I I x k 1 k m x .
In the literature on DNNs, this module is also referred to as a multifilter ( k i is typically a filtering operation followed by an activation function) or multichannel module. If partition P is associated with the input (i.e., x ) in (22) and partition P i is associated with the output of the i-th channel (i.e., k i x ), then P i is a refinement of P in accordance with (2). As shown in the middle subgraph of Figure 6, the node corresponding to P i is a child node of the node corresponding to P . This graph is a tree in which the node of P has m child nodes corresponding to partitions P 1 , , P m .
(4) Fusion module (Figure 7b): This type of module can be used to combine different parts of a DNN and uses linear operations to fuse inputs. The module is denoted as follows:
( fusion ) L ( concatenation ) .
In matrix form, we obtain the following:
x { k 1 x 1 , , k m x m } L k 1 x 1 k m x m .
Note that a non-linear fusion can be obtained by applying a composition of ρ M / ρ to the fusion module in which L = I .
We denote the domain partition of χ associated with the i-th channel as P i = { R i j i | j i = 1 , n i } , where n i is the number of partition regions. Partition P of χ generated by the fusion module can be expressed as the union of non-empty intersections of partition regions in P i for all i, as follows:
P P 1 P 2 P m = j 1 , , j m ( R 1 j 1 R 2 j 2 R m j m ) .
Any partition region in P is contained in precisely one region in any P i . In other words, P is a refinement of P i for i = 1 , , m . An obvious bound for partition regions of P is i = 1 m n i . We let { A i j i | j i = 1 n i } denote the affine mappings associated with the i-th channel and A i j i be the affine mapping with the domain restricted to partition region R i j i . The affine mapping of the fusion module over partition region R 1 j 1 R 2 j 2 R m j m is derived as follows:
R 1 j 1 R 2 j 2 R m j m L A 1 j 1 A m j m .
For the sake of convenience, (24) and (25) are summarized in the following lemma.
Lemma 3.
Suppose that a fusion module comprises m channels. Let P i denote the partition associated with the i-th channel of the module and let P denote the partition of the fusion module. Then, P is a refinement of P i for i = 1 , , m . Moreover, let A denote the collection of affine linear mappings over P . Thus, the affine linear mapping over a partition region of P can be obtained in accordance with (25).
As shown in the rightmost subfigure of Figure 6, P is a refinement of any P i , which means that the node corresponding to partition P is a child node of the node corresponding to partition P i . The fact that the node associated with partition P has more than one parent node means that the graph for this module is not a tree.
Example 2.
Figure 8 illustrates the fusion of two channels, as follows:
M : R 2 L ρ 1 M 1 ρ 2 M 2 ,
where M 1 , M 2 : R 2 R 2 , and ρ 1 and ρ 2 are ReLUs. The partition induced by M comprises eight polytopes, each of which is associated with an affine linear mapping.
Figure 8. Fusion of two channels in which each channel partitions R 2 into four regions: (a) P 1 is the partition due to ρ 1 M 1 , where M 1 : R 2 R 2 ; (b) P 2 is the partition due to ρ 2 M 2 , where M 2 : R 2 R 2 ; (c) P is a refinement of P 1 and P 2 .
Figure 9 illustrates the refinement of partitions in the network in Figure 9a, which involved a series connection of five fusion layers. Each fusion layer includes a fusion module derived by a concatenation of inputs ( ReLU M 1 (top) and ReLU M 2 (bottom)). The result of the concatenation is subsequently input into linear function L = I I (fusion). The curves in Figure 9b–d top–bottom respectively) and fusion channels are consistent with the assertion of Lemma 3, wherein the partitions of the top and bottom channels are refined by the fusion channel. The fusion channel is the refinement of top and bottom channels, which means that the maximum number of vectors in a given partition region and the maximum distance between pairs of points in the same partition region are smaller than in the top and bottom channels, as shown in Figure 9b,c. On the other hand, the number of partition regions in the fusion channel is larger than in the top and bottom channels, as shown in Figure 8 and Figure 9d.
Figure 9. Simulation of partition refinement using fusion modules with 50,000 random points ( 14 × 1 ) as inputs with entries sampled independently and identically distributed (i.i.d.) from standard normal distribution. (a) Network comprising five layers of fusion modules, each of which comprises three channels (top, bottom, and fusion). The dimensions of weight matrix and bias of M i in the top and bottom channels are 14 × 14 and 14 × 1 , respectively, with coefficients in M i sampled i.i.d. from standard normal distribution and L = I I . (b) The number of points in partition regions containing at least two elements versus fusion layers. (c) The maximum distance between pairs of points located in the same partition region versus fusion layers. (d) The number of partition regions versus fusion layers. The fact that in (b,c), the curves corresponding to fusion channels are beneath those of the other channels and that in (d), the curve of fusion channel is above the other channels is consistent with the analysis that the partitions at fusion channels is finer than those at the bottom and top channels.
(5) The following DNN networks were derived by applying the DAG closure rule (R) to the modules.
Example 3.
As shown in Figure 10, the ResNet module [11] comprises ReLU and a fusion module:
( ResNet ) ρ ( fusion ) .
Figure 10. ResNet module featuring direct link and the same domain partitions at points a and b. In DenseNet, the addition node is replace with the concatenation node.
Using matrix notation, we obtain the following:
x ρ [ I I ] diag ( I , M 2 ρ M 1 ) I I x = ρ ( I M 2 ρ M 1 ) x ,
where M 1 and M 2 are affine mappings. The unique feature of ResNet is the direct link, which enhances resistance to the gradient vanishing problem in back-propagation algorithms [29]. It is a fact that direct linking and batch normalization [30,31] have become indispensable elements in the learning of very deep neural networks using back-propagation algorithms.
Let M L denote an L-layer DNN. A residual network [11] extends M L from L layers to L + 2 layers as follows: ( ResNet ) M L . Repetition of this extension allows a residual network to maintain an arbitrary number of layers. As noted in the caption of Figure 10, domain partitioning is the same at a and b. This can be derived in accordance with the following analysis. Let P 0 denote the domain partitioning of χ at the input of the module. The top channel of the parallel module retains the partition, whereas in the bottom channel, the partition is refined as P 1 using M 2 ρ M 1 . In accordance with (24), the domain of the fusion function is P 1 P 0 (i.e., P 1 ). Thus, the domain partitions at a and b are equivalent. Note that the DenseNet module [32] replaces the addition node in Figure 10 with the concatenation node. The partitions of DenseNet at b and a are the same, as in the ResNet case.
Example 4.
Transformers are used to deal with sequence-to-sequence conversions, wherein the derivation of long-term correlations between tokens in a sequence is based on the attention module [33,34]. A schematic illustration of a self-attention module is presented in Figure 11a, where the input vectors are a 1 , a 2 , a 3 , and a 4 and outputs vectors are b 1 , b 2 , b 3 , and b 4 . The query, key, and value vectors for a i are generated from matrix W q , W k , and W v , respectively, where q i = W q a i , k i = W k a i , and v i = W v a i for all i. Attention score α i , j indicates the inner product between the normalized vectors of q i and k j . The vector of attention scores [ α i , 1 , α i , 2 , α i , 3 , and α i , 4 ] are input into the soft-max layer to obtain probability distribution [ α ^ i , 1 , α ^ i , 2 , α ^ i , 3 , α ^ i , 4 ] , where α ^ i , j = e α i , j j e α i , j . Output vector b i = j α ^ i , j v j is derived via multiplications and additions as a linear combination of value vectors with coefficients derived from the probability distribution. Figure 11b presents a graphical representation of (a), wherein non-linear transformation σ is the soft-max function. Dictionary V of value vectors can be obtained by performing a concatenation operation, which implicitly involves reshaping the dimensions of the resulting vector to the matrix (see dashed box in the figure).
Figure 11. Self-attention module: (a) network; (b) graph in which the highlighted dashed box represents the graph used to obtain dictionary V from value vectors. The different color is corresponding to different input a i .
Example 5.
Figure 12 illustrates the well-known LeNet-5 network [35]. Figure 12a presents a block diagram of the network in which the input is an image ( 28 × 28 px) and the output is ten classes of characters (0 to 9). CNN block number 1 is a convolution layer with the following parameters: filter size, 5 × 5 ; stride, 1; padding, 2; and six-channel output, six 28 × 28 images. Block numbers 2 and 3 indicate MaxLU operations and six-channel output (six 14 × 14 images). Black circle 4 indicates a concatenation operation (O2), the output of which is an image. Block number 5 indicates a convolution layer with the following parameters: filter size, 5 × 5 ; stride, 1; and padding, 0. This layer outputs sixteen 10 × 10 images. Block numbers 6 and 7 indicate MaxLU operations, the output of which is sixteen 5 × 5 images. Black circle 8 indicates a concatenation operation (O2), the output of which is a vector. Block number 9 indicates a fully connected network with input dimensions of 400 and output dimensions of 120. Block number 10 indicates a ReLU activation function. Block number 11 is a fully connected network with an input dimension of 120 and an output dimension of 84. Block number 12 indicates a ReLU activation function. Block number 13 indicates a fully connected network with an input dimension of 84, where the output is a prediction that includes 1 of the 10 classes. Figure 12b presents a graphical representation of (a), and (c) presents a simplified graphical representation of (a). In Figure 12c, we can see that LeNet-5 begins with a sequence of compositions of modules (featuring a parallel module followed by a fusion module), then a sequence of MaxLU layers. Here, we apply the LeNet-5 network to the MNIST dataset (not limited to other datasets) with 60 , 000 images. This is a simple implementation illustrating the proposed approach. Figure 13 illustrates the properties of partitions at the outputs of levels 3, 4, 7, and 8 in Figure 12b. The curves in Figure 13a–c are consistent with the assertion of Lemma 3, which indicates that the partitions of the previous channels are refined by the fusion channel. The results are similar to those in Figure 9.
Figure 12. LeNet-5: (a) network; (b) graph in which the numbers beneath the nodes and arcs correspond to block numbers in (a) and solid nodes indicate fusion module/concatenation; (c) simplification of (b) illustrating composition of modules, where I and O denote input and output, respectively.
Figure 13. Illustrations of partition refinements in LeNet-5 at output levels 3, 4, 7, and 8 (as shown in Figure 12b), where the input is 60,000 images from the MNIST dataset (note that levels 3 and 7 are parallel connections linking several channels; therefore, only the maximum values of all channels at those levels are plotted): (a) maximum number of images in a given partition region; (b) maximum distance between pairs of images located in the same partition region (distance at level 8 is zero, which means that each partitioning region contains no more than one image); (c) maximum number of partition regions in a channel (note that as the level increases, the curves in (a,b) decrease, while the curve in (c) increases, which is consistent with our assertion that the partitions of the previous channels are refined by the fusion channel).
Remark 1.
Graphs corresponding to partitions generated by series and parallel modules are trees. Accordingly, the partitions created by a network comprising only series and parallel modules also from a tree. According to [36], trees learned for regression or classification purposes suffer from overfitting due to the bias–variance tradeoff dilemma. However, in [37,38], it was reported that this tradeoff is not an over-riding concern (referred to as benign overfitting) when learning a deep neural network in which the number of parameters exceeds the training data by a sufficient margin. This suggests that benign overfitting is relevant to the fusion module, considering that the graph for the module is not a tree.

4. Properties of General DNNs

In accordance with the axiomatic approach, we define general DNNs (i.e., K ) as those that can be represented using DAGs. In the following, we outline the properties belonging to all members in the class.

4.1. Function Approximation via Partition Refinement

We define the computable subgraph of node a as the subgraph of the DAG containing precisely all the paths from the input node of the DAG to node a. Clearly, the computable subgraph defines a general DNN, the output node of which is a, such that it computes a CPWL function. In Figure 14a, the computable subgraph of node a (highlighted in light blue) contains node numbers 0, 1, 2, 3, and 4. The computable subgraph of node c (highlighted in light green) is contained in the subgraph of a.
Figure 14. DAG representation of a DNN illustrating the computable subgraphs and levels of nodes (solid nodes denote fusion/concatenation nodes, and I and O denote input and output, respectively). (a) Computable subgraph associated with each node. Light blue denotes the computable subgraph of node a, wherein the longest path from the input node to node a contains four arcs (i.e., l ( a ) = 4 ). Light green denotes the computable subgraph of c, wherein the longest path from the input node to node c contains one arc (i.e., l ( c ) = 1 ). (b) Nodes in (a) are partially ordered in accordance with levels (e.g., level 1 has four nodes, and level 2 has one node).
In the following, we outline the domain refinement of a general DNN. Specifically, if node b is contained in a computable subgraph of a, then the domain partition imposed by that subgraph is a refinement of the partition imposed by the computable subgraph of b.
Theorem 1.
The domain partitions imposed by computable subgraphs G a (at node a) and G b (at node b) of a general DNN are denoted as P a and P b , respectively. Suppose that node b is contained in subgraph G a . This means that P a refines P b .
Proof. 
Without a loss of generality, we suppose that arcs in the general DNN are atomic operations and that the functions applied to arcs are included in base set B . Furthermore, we suppose that p is a path from the input node to node a, which also passes through node b. p denotes the subpath of p from node b to node a ( p : b = c 0 c 1 a = c n ). P c i denotes the partition of input space χ defined using the computable subgraph at node c i . In the following, we demonstrate that if c i c i + 1 , then P c i is refined by P c i + 1 . Note that arc c i c i + 1 belongs to one of the three atomic operations. If it is a series-connection operation (O1), then the refinement can be obtained by referring to Lemma 2. If it is a concatenation operation (O2), then the refinement is obtained by referring to Lemma 3. If it is a duplication operation (O3), then the partitions for nodes c i and c i + 1 are the same. Thus, P a is a refinement of P b . □
Figure 14a presents the DAG representation of a DNN. Node b is contained in the computable subgraph of node a, whereas node c is contained in the computable sub-graph of b such that the domain partition of a is a refinement of the partition of b, and the domain partition of b is a refinement of the partition of c. Thus, the domain partition of node a is a refinement of the domain partition of node c.
As hinted in Theorem 1, general DNNs are implemented using a data-driven “divide and conquer” strategy when performing function approximation. In other words, when traveling a path from the input node to a node of the DNN, we can envision the progressive refinement of the input space partition along the path, where each partition region is associated with an affine linear mapping. Thus, computing a function using a general DNN is equivalent to approximating the function using local simple mappings over regions in a partition derived using the DNN. A finer approximation of the function can be obtained by increasing the lengths of paths from the input node to the output node. Figure 15 illustrates the conventional and general DNN approaches to the problem of linear regression. The conventional approach involves fitting “all” of the data to obtain a dashed hyperplane, whereas the general DNN approach involves dividing the input space into two parts and fitting each part using hyperplanes.
Figure 15. Approaches to linear regression. The dashed line indicates the regression derived from all data (i.e., conventional approach), and the solid lines indicate regressions derived from data of x > a and data of x a continuous at x = a (i.e., general DNN approach).

4.2. Stability via Sparse/Compressible Weight Coefficients

We introduce the l function of node a to denote the number of arcs along the longest path from input node I to node a of a DAG, where l ( a ) is referred as the level of node a. According to this definition, the level of the input node is zero.
Lemma 4.
The level is a continuous integer across the nodes in a general DNN. In other words, if node a is not the output node, then there must exist a node b where l ( b ) = l ( a ) + 1 .
Proof. 
Clearly, l ( I ) = 0 for input node I. This lemma can be proven via contradiction. Suppose that the levels are non-continuous integers. Without a loss of generality, the nodes can be divided into two groups (A and B), where A includes all of the nodes with level n and B includes all of the nodes with level n + k and k > 1 . b B is assigned the lowest level in graph B, and l ( b ) = n + k . Furthermore, p ( b ) denotes the longest path from the input node to node b, and a is a node along the path with arc a b . Thus, l ( a ) < n + k ; otherwise, l ( b ) n + k + 1 , which violates the assumption that l ( b ) = n + k . If l ( a ) n , then l ( b ) n + 1 (since a is on the longest path ( p ( b ) ) to b and a has a direct link to b). This violates the assumption that l ( b ) = n + k with k > 1 . Thus, n < l ( a ) < n + k , according to which a A and a B . This violates the assumption that all nodes can be divided into two groups (A and B). We obtain a contradiction and, hence, complete the proof. □
The nodes in a DAG can be ordered in accordance with the levels. Assume that there is only one output node, denoted as O. Clearly, l ( O ) = L is the highest level associated with that DAG. The above lemma implies that the nodes can be partitioned into levels from 0 to L. We introduce notation l ¯ ( n ) (referring to the nodes at level n) to denote the collection of nodes with levels equal to n and let | l ¯ ( n ) | denote the number of nodes at that level. As shown in Figure 14b, the nodes in Figure 14a are ordered in accordance with their levels, as indicated by the number besides the nodes. For any DNN ( N L K ), we can define DNN function N n (with n L ) by stacking the DNN functions of nodes at level n into a vector as follows:
N n = [ N ( a ) ] a l ¯ ( n )
where N ( a ) is the function derived using the computable subgraph of node a l ¯ ( n ) . Clearly, N n K because it is formed by the concatenation of N ( a ) . For example, in Figure 14b, N 3 = [ N ( b ) ] and N 5 = N ( s ) N ( t ) . The order of components in N n is irrelevant to subsequent analysis of stability conditions.
The stability of a DNN can be measured as the output perturbation against the input perturbation such that
N L ( x ) N L ( y ) 2 C ( L ) x y 2 ,
where L is the level of the output node for DNN N L . A stable deep architecture implies that a deep forward inference is well-posed and robust in noisy environments. A sufficient condition for the stability of a DNN requires that C ( L ) be a bounded, non-increasing function of L when L .
Lemma 5.
Let d be the uniform bound defined in (21), I a = { b | b a } (the nodes directly linking to the node a), and | I a | denote the number of nodes in I a . Further, let N L K and N L , p denote the restriction of N L over partition region p . Suppose that N ( a ) and N ( a ) , p denote the CPWL function associated with the computable subgraph of node a and the restriction of the function over p , respectively. As defined in (27), N n denotes the function derived by nodes at level n L , and N n , p denotes the restriction of N n on domain partition p ; i.e.,
N n , p = [ N ( a ) , p ] a l ¯ ( n ) .
(i) For a given n and p , there exists C ( n , p ) such that for any x , y p ,
N n , p ( x ) N n , p ( y ) 2 C ( n , p ) x y 2 ,
where C ( n , p ) is referred to as the Lipschitz constant in p at level n.
(ii) If there exists level m such that for n m ,
d a l ¯ ( n ) b I a W a b 2 1 ,
where W a b is the weight matrix associated with the atomic operation on arc a b , which means that the Lipschitz constant C ( n , p ) is a bounded function of n on p .
Proof. 
See Appendix A for the proof. □
This lemma establishes local stability in a partition region of a DNN. To achieve global stability in the input space, we invoke the lemma in [5], which indicates that piece-wise functions that maintain local stability have global stability, provided that the functions are piece-wise continuous.
Lemma 6
([5]). Let χ = i P i be a partition and f i with domain P i ¯ be l i -Lipschitz continuous with f i ( x ) = f j ( x ) for x P i ¯ P j ¯ . Let f be defined by f ( x ) : = f i ( x ) for x P i . Then, f is ( max i l i ) -Lipschitz continuous.
Theorem 2.
For the sake of stability, we adopted the assumption pertaining to Lemma 5. Let DNN N L K with the domain in input space (χ) and let N n denote the function with nodes of N L up to level n L . If there exists level m such that for n m ,
d a l ¯ ( n ) b I a W a b 2 1
where d is the uniform bound defined in (21), and W a b is the weight matrix associated with arc a b ; then, for any partitioning region p of χ, C ( L ) = max p C ( L , p ) is a bounded, non-increasing function of L on χ. Note that C ( L ) is defined in (28). Hence, N L is a stable architecture.
Proof. 
See Appendix B for the proof. □
This theorem extends the stability of the series-connected DNNs in [5] to general DNNs. According to W F W 2 , the condition determining the stability of an DNN can be expressed as follows:
d a l ¯ ( n ) b I a W a b F 1 .
This condition holds, regardless of the size of W a b . Thus, if the matrix is large, then (33) implies that the weight coefficients are sparse/compressible. Note that (32) is a sufficient condition for a DNN to achieve stability; however, it is not a necessary condition. The example in Figure 16 demonstrates that satisfying (32) is not necessary.
Figure 16. Conditions for stability of ResNet module I W 2 ( ReLU ) W 1 , where I a = { b , c } : If (32) is satisfied, then W 2 2 = 0 (hence, W = 0 ) due to the fact that 1 + W 2 2 1 if and only if W 2 2 = 0 . In fact, it is sufficient for the model to achieve stability if W 0 with I W 2 W 1 2 1 (all eigenvalues of W 2 W 1 lie within [ 0 , 1 ] ).
Figure 17 illustrates simulations pertaining to Theorem 2 for the network presented in Figure 9a, which comprises a fusion module with five layers. The simulations illustrate how the weight coefficients affect the Lipschitz constant. In Figure 17a, the upper bound for Lipschitz constant C ( L ) , where L = 1 , , 5 (presented as the maximum gain N L x N L y 2 / x y 2 for all pairs of training data), increases with an increase in the number of fusion layers if the weight coefficients do not satisfy (32) in terms of stability. If the weight coefficients were scaled to be compressible in accordance with (32), then the upper bound would decrease with an increase in the number of fusion layers, as shown in Figure 17b. This is an indication that the network is stable relative to the input perturbation.
Figure 17. Stability simulation using the network shown in Figure 9a, in which the maximum gain ( N L x N L y 2 / x y 2 ) for all training pairs is plotted against each fusion layer (j), where j = 1 , , 5 (dimensions of the weight matrices and biases in M i in the top, bottom, and fusion channels of each module are 20 × 20 and 20 × 1 , respectively; coefficients in M i are sampled i.i.d. from standard normal distribution (mean zero and variance of one); inputs are 2000 random vectors (each of size 20 × 1 ) with entries sampled i.i.d. from the standard normal distribution): (a) maximum gain increases with an increase in the number of fusion layers; (b) maximum gain remains bounded when weight coefficients in M i are scaled to meet (33) for Theorem 2. Note that d = 1 for ReLU activation functions.

5. Conclusions

Using an axiomatic approach, we established a systematic approach to representing deep feedforward neural networks (DNNs) as directed acyclic graphs (DAGs). The class of DNNs constructed using this approach is referred to as general DNNs, which includes DNNs with pragmatic modules, activation functions, and non-linear transformations. We demonstrated that general DNNs approximate a target function in a coarse-to-fine manner by learning a directed graph (generally not in a tree configuration) through the refinement of input space partitions. Furthermore, if the weight coefficients become increasingly sparse along any path of a graph, then the DNN function gains stability with respect to perturbations in the input space due to a bounded global Lipschitz constant. This study provides a systematic approach to studying a wide range of DNNs with architectures that are more complex than those of simple series-connected DNN models. In the future, we will explore the analysis and conclusions reported in this paper in terms of their applications. In this direction, we refer to [39] with respect to network pruning.

Author Contributions

Conceptualization, W.-L.H.; Software, S.-S.T.; Validation, S.-S.T.; Formal analysis, W.-L.H.; Data curation, S.-S.T.; Writing—original draft, W.-L.H.; Writing—review & editing, W.-L.H.; Visualization, S.-S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Acknowledgments

An error in the original proof of Lemma 5 was corrected by Ming-Yu Chung. Figure 1 and Figure 2 were designed and drawn by Jinn Ho.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 5

Proof. 
(i) Without a loss of generality, the arcs in N L can be treated as atomic operations associated with functions in basis set B . The base step is on input node I (i.e., l ¯ ( 0 ) ). Clearly, (30) holds when C ( 0 , p ) = 1 .
Implementation of the induction step is based on levels. Suppose that (30) holds for all nodes in levels lower than n. If a is a node in level n, then node b I a must be in level l ( b ) < n . Thus, for x p ,
N ( a ) , p = [ B a b N ( b ) , p ( x ) ] b I a ,
where B a b B , in accordance with the fact that functions associated with axiomatic operations on arcs are members of B . If the atomic operation is a duplication, then B a b = I , and if the atomic operation is a series connection/concatenation, then B a b { I , ρ L , M , ρ M , σ L , σ M } . Furthermore,
N ( a ) , p ( x ) N ( a ) , p ( y ) 2 = b I a B a b ( N ( b ) , p ( x ) N ( b ) , p ( y ) ) 2 .
Case 1. Consider B a b { I , ρ L , M , ρ M } . For x , y p , the bias term in B a b (if any) can be canceled. Applying the uniform bound assumption on activation functions (18) and applying (29) and (30) to levels lower than n results in the following:
N n , p ( x ) N n , p ( y ) 2 a l ¯ ( n ) b I a d ρ W a b 2 N ( b ) , p ( x ) N ( b ) , p ( y ) 2 a l ¯ ( n ) b I a d ρ W a b 2 C ( l ( b ) , p ) x y 2 .
Case 2. Consider B a b { I , σ L , M , σ M } . Similarly, for x , y p , we can obtain the following:
N n , p ( x ) N n , p ( y ) 2 = a l ¯ ( n ) b I a σ ( M a b N ( b ) , p ( x ) ) σ ( M a b N ( b ) , p ( y ) ) 2 a l ¯ ( n ) b I a d σ M a b N ( b ) , p ( x ) M a b N ( b ) , p ( y ) 2 a l ¯ ( n ) b I a d σ W a b 2 N ( b ) , p ( x ) N ( b ) , p ( y ) 2 a l ¯ ( n ) b I a d σ W a b 2 C ( l ( b ) , p ) x y 2 ,
where W a b is the linear part of the affine function ( M a b ), and d σ is the Lipschitz constant bound defined in (19). Finally, (A3) and (A4) are combined using (21) to yield
C ( n , p ) = d a l ¯ ( n ) b I a W a b 2 C ( l ( b ) , p ) ,
where l ( b ) n 1 . This concludes the proof of (i).
(ii) Let C * ( m 1 , p ) denote the maximal value of C ( k , p ) for k m 1 . From (A5) and (31), we obtain the following:
C ( m , p ) d a l ¯ ( m ) b I a W a b 2 C * ( m 1 , p ) C * ( m 1 , p ) .
Hence, C * ( m , p ) = C * ( m 1 , p ) . Considering the fact that (31) holds for n m , we obtain
C ( n , p ) d a l ¯ ( n ) b I a W a b 2 C * ( n 1 , p ) C * ( n 1 , p )
Thus, C * ( n , p ) = C * ( n 1 , p ) . Based on (A6) and (A7), we obtain
C * ( n , p ) = C * ( m 1 , p ) , n m .
The fact that C ( n , p ) C * ( n , p ) leads to the conclusion that C ( n , p ) is a bounded sequence of n on p . □

Appendix B. Theorem 2

Proof. 
In accordance with (32) and Lemma 5, C ( n , p ) is bounded for any p and n. The fact that activation functions of N L satisfy (A1) implies that the total number of partitions induced by an activation function is finite. Thus, the number of partition regions induced by N n is finite. Hence,
C ( n ) = max p C ( n , p )
is defined and bounded above for any n. In accordance with Lemma 6 and the definition of C ( n ) , we obtain
N n ( x ) N n ( y ) 2 C ( n ) x y 2 .
We define C * ( m ) = max k m max p C ( k , p ) and, for any L n m , obtain
C ( n ) C * ( m ) .
The sequence { C ( n ) } n m L is bounded for any L such that N L is stable as L . □

References

  1. Yang, Y.; Sun, J.; Li, H.; Xu, Z. Deep ADMM-Net for compressive sensing MRI. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
  2. Monga, V.; Li, Y.; Eldar, Y.C. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Process. Mag. 2021, 38, 18–44. [Google Scholar] [CrossRef]
  3. Molnar, C. Interpretable Machine Learning. Lulu.com. 2020. Available online: https://christophmolnar.com/books/interpretable-machine-learning/ (accessed on 9 September 2023).
  4. Balestriero, R.; Cosentino, R.; Aazhang, B.; Baraniuk, R. The Geometry of Deep Networks: Power Diagram Subdivision. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 15806–15815. [Google Scholar]
  5. Hwang, W.L.; Heinecke, A. Un-rectifying non-linear networks for signal representation. IEEE Trans. Signal Process. 2019, 68, 196–210. [Google Scholar] [CrossRef]
  6. Li, Q.; Lin, T.; Shen, Z. Deep learning via dynamical systems: An approximation perspective. arXiv 2019, arXiv:1912.10382. [Google Scholar] [CrossRef]
  7. Baraniuk, R. The local geometry of deep learning (Power Pointer slides). In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (Plentary Speech), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
  8. Sun, W.; Tsiourvas, A. Learning Prescriptive ReLU Networks. arXiv 2023, arXiv:2306.00651. [Google Scholar]
  9. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  10. Duvenaud, D.; Maclaurin, D.; Aguilera-Iparraguirre, J.; Gómez-Bombarelli, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional networks on graphs for learning molecular fingerprints. arXiv 2015, arXiv:1509.09292. [Google Scholar]
  11. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  12. Weinan, E. A proposal on machine learning via dynamical systems. Commun. Math. Stat. 2017, 1, 1–11. [Google Scholar]
  13. Lu, Y.; Zhong, A.; Li, Q.; Dong, B. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3276–3285. [Google Scholar]
  14. Haber, E.; Ruthotto, L. Stable architectures for deep neural networks. Inverse Probl. 2017, 34, 014004. [Google Scholar] [CrossRef]
  15. Chang, B.; Chen, M.; Haber, E.; Chi, E.H. AntisymmetricRNN: A dynamical system view on recurrent neural networks. arXiv 2019, arXiv:1902.09689. [Google Scholar]
  16. Ascher, U.M.; Petzold, L.R. Computer Methods for Ordinary Differential Equations and Differential-Algebraic Equations; SIAM: Philadelphia, PA, USA, 1998; Volume 61. [Google Scholar]
  17. Mallat, S. Group invariant scattering. Commun. Pure Appl. Math. 2012, 65, 1331–1398. [Google Scholar] [CrossRef]
  18. Bruna, J.; Mallat, S. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1872–1886. [Google Scholar] [CrossRef] [PubMed]
  19. Kuo, C.C.J.; Chen, Y. On data-driven saak transform. J. Vis. Commun. Image Represent. 2018, 50, 237–246. [Google Scholar]
  20. Gregor, K.; LeCun, Y. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 399–406. [Google Scholar]
  21. Zhang, J.; Ghanem, B. ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1828–1837. [Google Scholar]
  22. Chan, S.H. Performance analysis of plug-and-play ADMM: A graph signal processing perspective. IEEE Trans. Comput. Imaging 2019, 5, 274–286. [Google Scholar] [CrossRef]
  23. Heinecke, A.; Ho, J.; Hwang, W.L. Refinement and universal approximation via sparsely connected ReLU convolution nets. IEEE Signal Process. Lett. 2020, 27, 1175–1179. [Google Scholar] [CrossRef]
  24. Arora, R.; Basu, A.; Mianjy, P.; Mukherjee, A. Understanding deep neural networks with rectified linear units. arXiv 2016, arXiv:1611.01491. [Google Scholar]
  25. Goodfellow, I.; Warde-Farley, D.; Mirza, M.; Courville, A.; Bengio, Y. Maxout networks. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1319–1327. [Google Scholar]
  26. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
  27. Gao, B.; Pavel, L. On the properties of the softmax function with application in game theory and reinforcement learning. arXiv 2017, arXiv:1704.00805. [Google Scholar]
  28. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  29. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  30. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
  31. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  32. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  34. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  35. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  36. Geman, S.; Bienenstock, E.; Doursat, R. Neural networks and the bias/variance dilemma. Neural Comput. 1992, 4, 1–58. [Google Scholar] [CrossRef]
  37. Belkin, M.; Hsu, D.; Ma, S.; Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. USA 2019, 116, 15849–15854. [Google Scholar] [CrossRef] [PubMed]
  38. Cao, Y.; Chen, Z.; Belkin, M.; Gu, Q. Benign overfitting in two-layer convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 25237–25250. [Google Scholar]
  39. Hwang, W.L. Representation and decomposition of functions in DAG-DNNs and structural network pruning. arXiv 2023, arXiv:2306.09707. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.