The Representation Theory of Neural Networks

In this work, we show that neural networks can be represented via the mathematical theory of quiver representations. More specifically, we prove that a neural network is a quiver representation with activation functions, a mathematical object that we represent using a {\em network quiver}. Also, we show that network quivers gently adapt to common neural network concepts such as fully-connected layers, convolution operations, residual connections, batch normalization, and pooling operations. We show that this mathematical representation is by no means an approximation of what neural networks are as it exactly matches reality. This interpretation is algebraic and can be studied with algebraic methods. We also provide a quiver representation model to understand how a neural network creates representations from the data. We show that a neural network saves the data as quiver representations, and maps it to a geometrical space called the {\em moduli space}, which is given in terms of the underlying oriented graph of the network. This results as a consequence of our defined objects and of understanding how the neural network computes a prediction in a combinatorial and algebraic way. Overall, representing neural networks through the quiver representation theory leads to 13 consequences that we believe are of great interest to better understand what neural networks are and how they work.


Introduction
Neural networks have achieved unprecedented performances in almost every area where machine learning is applicable (Raghu and Schmidt, 2020;Goodfellow et al., 2016). Through their history, neural networks have had several turning points with ground-breaking consequences that unleashed the power of neural networks. To name a few, one might regard the chain rule backpropagation (Rumelhart et al., 1986), the invention of convolutional layers (LeCun et al., 1989) and recurrent models (Rumelhart et al., 1986), the advent of low-cost specialized parallel hardware (mostly GPUs) (Krizhevsky et al., 2012) and the exponential growth of available training data as some of the most important factors behind today's success of neural networks.
Ironically, despite our understanding of every atomic element of a neural network and our capability to successfully train it, it is still difficult with today's formalism to understand what makes neural networks so effective. As neural nets increase in size, the combinatorics between its weights and activation functions makes it impossible (at least today) to formally answer questions such as : (i) why neural networks [almost] always converge towards a global minima irregardless of their initialization, the data it is trained on and the associated loss function? (ii) what is the true capacity of a neural net? (iii) what are its true generalization capabilities?
One may hypothesize that the limited understanding of these fundamental concepts derives from the more or less formal representation that we have of these machines. Since the '80s, neural nets have been mostly represented in two ways: (i) a cascade of nonlinear atomic operations (be it, a series of neurons, layers, convolution blocks, etc.) often represented graphically (e.g., Fig.3 by He et al. (2016)) and (ii) a point in an N dimensional Euclidean space (where N is the number of weights in the network) lying on the slope of a loss landscape that an optimizer ought to climb down (Li et al., 2018).
In this work, we propose a fundamentally different way to represent neural networks. Based on quiver representation theory, we provide a new mathematical footing to represent a neural network as well as the data it processes. We show that this mathematical representation is by no means an approximation of what neural networks are as it tightly matches reality.
In this paper, we do not focus on how neural networks learn, but rather on the intrinsic properties of their architectures. Therefore providing new insights on how to understand neural networks. Our mathematical formulation accounts for a wide variety of architectures, usages and behaviours of today's neural networks. For this, we study the combinatorial and algebraic nature of neural networks by using ideas coming from the mathematical theory of quiver representations (Assem et al., 2006;Schiffler, 2014). This paper is based on two observations that expose the algebraic nature of neural networks and how it is related to quiver representations: 1. Neural networks are used as quiver representations together with activation functions.
2. The forward pass of data through the network is encoded as quiver representations.
Everything else in this work is a mathematical consequence of these two observations. Our main contributions can be summarized by the following six items: 1. We provide the first explicit link between representations of quivers and neural networks.
2. We show that quiver representations gently adapt to common neural network concepts such as fully-connected layers, convolution operations, residual connections, batch normalization, and pooling operations.
3. We prove that algebraic isomorphisms of neural networks preserve the network function and obtain, as a corollary, that ReLU networks are positive scale invariant (Dinh et al., 2017;Neyshabur et al., 2015). 4. We present the theoretical interpretation of data in terms of the architecture of the neural network and of quiver representations. 5. We mathematically formalize a modified version of the manifold hypothesis of Goodfellow et al. (2016) in terms of the architecture of the network.
6. We provide constructions and results supporting existing intuitions in deep learning while discarding others, and bring new concepts to the table.

Previous work
In the theoretical description of the deep neural optimization paradigm given by Choromanska et al. (2015), the authors underline that "clearly the model (neural net) contains several dependencies as one input is associated with many paths in the network. That poses a major theoretical problem in analyzing these models as it is unclear how to account for these dependencies." Interestingly, this is exactly what quiver representations are about (Assem et al., 2006;Barot, 2015;Schiffler, 2014). While as far as we know, quiver representation theory has never been used in neural networks, some authors have nonetheless used a sub-set of it, sometimes unbeknown to them. It is the case of the so-called positive scale invariance of ReLU networks which Dinh et al. (2017) used to mathematically prove that most notions of loss flatness cannot be used directly to explain generalization. This property of ReLU networks has also been used by Neyshabur et al. (2015) to improve the optimization of ReLU networks. In their paper, they propose the Path-SGD (stochastic gradient descent), which is an approximate gradient descent method with respect to a path-wise regularizer. Also,  defined a space where points are ReLU networks with the same network function, which they use to find better gradient descent paths. In this paper (cf. Theorem 4.9 and Corollary 4.10), we prove that positive scale invariance of ReLU networks is a property derived from the representation theory of neural networks that we present in the following section. Wood and Shawe-Taylor (1996) used group representation theory to account for symmetries in the layers of a neural network. Our mathematical approach is different since quiver representations are representations of algebras (Assem et al., 2006) and not of groups. Besides, Wood and Shawe-Taylor (1996) present architectures that match mathematical objects with nice properties while we define the objects that model the computations of the neural network. We prove that quiver representations are more suited to study networks due to their combinatorial and algebraic nature. Healy and Caudell (2004) mathematically represent neural networks by objects called categories. However, as mentioned by the authors, their representation is an approximation of what neural nets are as they do not account for each of their atomic elements. In contrast, our quiver representation approach includes every computation involved in a neural network, be it a neural operation (i.e., dot product + activation function), layer operations (fully connected, convolutional, pooling) as well as batch normalization. As such, our representation matches what neural networks are and by no means is an approximation of it.
Quiver representations have been used to find lower-dimensional sub-space structures of datasets (Chindris and Kline, 2020) without, however, any relation to neural networks.
Our interpretation of data is orthogonal to this one since we look at how neural networks interpret the data in terms of its computations.
As mentioned by Arora in its 2018 ICML tutorial (Arora, 2018), our goal is to provide a theoretical footing that can validate certain intuitions about deep neural nets and lead to new insights and new concepts. One such intuition is related to feature map visualization. It is well known that feature maps can be visualized into images showing the input signal characteristics and thus providing intuitions on the behavior of the network and its impact on an image (Yosinski et al., 2015;Feghahati et al., 2019). This notion is strongly supported by our findings. Namely, our data representation introduced in Section 6 is a thin quiver representation that contains the network features (i.e., neuron outputs or feature maps) induced by the data. Said otherwise, our data representation includes both the network structure and the data (see Eq. (5) in page 21 and the proof of Theorem 6.4).
We show in Section 7 that our data representation lie into a so-called moduli space. Interestingly, the dimension of the moduli space (plus one) is the same value than that computed by  and used to measure the capacity of ReLU networks. They empirically confirmed that the dimension of the moduli space is directly linked to generalization.
The moduli space also formalize a modified version of the manifold hypothesis for the data (see Goodfellow et al., 2016, chap. 5.11.3). This hypothesis states that high-dimensional data (typically images and text) live on a thin and yet convoluted manifold in their original space. While this assumption is usually considered true, no theorem ever proved it. Concerning our data quiver representations, which are derived from the neural net architecture and not by the data itself, we mathematically prove that they lie in a manifold (generated by the image of an explicitly provided map) inside the moduli space.
Naive pruning of neural networks (Frankle and Carbin, 2019) where the smallest weights get pruned is also explained by our interpretation of the data and the moduli space (see consequence 4 on Section 7.1), since the coordinates of the data quiver representations inside the moduli space are given in function of the weights of the network and the activation outputs of each neuron on a forward pass (cf. Eq. (5) in page 21).
There exist empirical results where, up to certain restrictions, the activation functions can be learned (Goyal et al., 2019) and our interpretation of the data supports why this is a good idea in terms of the moduli space. For further details see Section 7.1.

Preliminaries of Quiver Representations
Before we show how neural networks are related to quiver representations, we start by defining the basic concepts related to the quiver representation theory (Assem et al., 2006;Barot, 2015;Schiffler, 2014).
Definition 3.1. (Assem et al., 2006, chap. 2) A quiver Q is given by a tuple (V, E, s, t) where (V, E) is an oriented graph with a set of vertices V and a set of oriented edges E, and maps s, t : E → V that send ∈ E to its source vertex s( ) ∈ V and target vertex t( ) ∈ V, respectively.
where vertices a to d are complex 3D, 2D, 1D and 5D vector spaces, and W α is a 2 × 3 matrix, W β is a 1 × 2 matrix, W γ is a 5 × 2 matrix and W δ is a 2 × 2 matrix. (c) Another quiver representation U over Q, where a to d are complex 4D, 1D, 3D and 2D vector spaces, and U α is a 1 × 4 matrix, U β is a 3 × 1 matrix, U γ is a 2 × 1 matrix and U δ is a 1 × 1 matrix.
Definition 3.2. (Assem et al., 2006, chap. 3) If Q is a quiver, a quiver representation of Q is given by a pair of sets where the W v 's are vector spaces indexed by the vertices of Q, and the W 's are linear maps indexed by the oriented edges of Q, such that for every edge ∈ E W : W s( ) → W t( ) . Fig. 1(a) illustrates a quiver Q while Fig. 1(b),(c) are two quiver representations of Q.
Definition 3.3. (Assem et al., 2006, chap. 3) Let Q be a quiver and let W and U be two representations of Q. A morphism of representations τ : W → U is a set of linear To illustrate this definition, one may consider the quiver Q and its representations W and U of Fig. 1. The morphism between W and U via the linear maps τ are pictured in Fig. 2(a). As shown, each t v is a matrix which allows to transform the vector space of vertex v of W into the vector space of vertex v of U .
Definition 3.4. Let Q be a quiver and let W and U be two representations of Q. If there is a morphism of representations τ : W → U where each τ v is an invertible linear map, then W and U are said to be isomorphic representations.
In section 4, we will be working with a particular type of quiver representations, where the vector space of each vertex is in 1D. This 1D representations are called thin representations, and the morphisms of representations between thin representations are easily described. More specifically, a thin representation of a quiver Q is a quiver representation W of Q such that W v = C for all vertex v ∈ V.
showing that the transformations τ v must make them commutative for τ : W → U to be a morphism of representations.
If W is a thin representation of Q, then every linear map W is a 1 × 1 matrix, so W is given by multiplication with a fixed complex number. We may and will identify every linear map between one dimensional spaces with the number whose multiplication defines it.
Before we move on to neural networks, we will introduce the notion of group and action of a group.
Definition 3.5. (Rotman, 1995, chap. 1) A non-empty set G is called a group if there exists a function · : G × G → G, called the product of the group denoted a · b, such that • There exists an element e ∈ G such that e · a = a · e = a for all a ∈ G, called the identity of G.
• For each a ∈ G there exists a −1 ∈ G such that a · a −1 = a −1 · a = e.
For example, the set of non-zero complex numbers C * (and also the non-zero real numbers R * ) with the usual multiplication operator forms a group. Usually, one does not write the product of the group as a dot and just concatenates the elements to denote multiplication ab = a · b, as for the product of numbers.
Definition 3.6. (Rotman, 1995, chap. 3) Let G be a group and let X be a set. We say that there is an action of G on X if there exists a map · : G × X → X such that • e · x = x for all x ∈ X, where e ∈ G is the identity.
• a · (b · x) = (ab) · x, for all a, b ∈ G and all x ∈ X.
In our case, G will be a group indexed by the vertices of Q, and the set X will be the set of thin quiver representations of Q.
Let W be a thin representation of a quiver Q. Given a choice of invertible (non-zero) linear maps τ v : C → C for every v ∈ V, we are going to construct a thin representation U such that τ = (τ v ) v∈V : W → U is an isomorphism of representations. Since U is thin, we have that U v = C for all v ∈ V. Let : a → b be an edge of E, we define the group action as follows Thus, for every edge ∈ E we get a commutative diagram The construction of the thin representation U from the thin representation W and the choice of invertible linear maps τ , defines an action on thin representations of a group. The set of all possible isomorphisms τ = (τ v ) v∈V of thin representations of Q forms such a group, called the change of basis group G defined as the product group where C * denotes the multiplicative group of non-zero complex numbers. That is, the elements of G are vectors of non-zero complex numbers τ = (τ 1 , ..., τ n ) indexed by the set V of vertices of Q, and the group operation between two elements τ = (τ 1 , ..., τ n ) and σ = (σ 1 , ..., σ n ) is by definition τ σ := (τ 1 σ 1 , ..., τ n σ n ). We use the action notation for the action of the group G on thin representations. Namely, for τ ∈ G of the form τ = (τ v ) v∈V and a thin representation W of Q, the thin representation U constructed above is denoted τ · W .

Neural Networks
In this section, we connect the dots between neural networks and the basic definitions of quiver representation theory that we presented before. But before we do so, let us mention that since the vector space of each vertex of a quiver representation is defined over the complex numbers, it implies that the weights on the neural networks that we are to present will also be complex numbers. Despite some papers on complex neural networks (Nitta, 1997), this approach may seem unorthodox. However, the use of complex numbers is a mathematical pre-requist for the upcoming notion of moduli space that we will introduce in section 7.
From the rest of this paper, we will focus on a special type of quiver Q that we call network quiver. The network quiver Q has no oriented cycles other than loops. Also, a sub-set of d source vertices of Q are called the input vertices. The source vertices that are not input vertices are called bias vertices. Let k be the number of all sinks of Q, we call these the output vertices. All other vertices of Q are called hidden vertices.
Definition 4.1. A quiver Q is arranged by layers if it can be drawn from left to right arranging its vertices in columns such that: • There are no oriented edges from vertices on the right to vertices on the left.
• There are no oriented edges between vertices in the same column, other than loops and edges from bias vertices.
The first layer on the left, called the input layer, will be formed by the d input vertices. The last layer on the right, called the output layer, will be formed by the k output vertices. The layers that are not input nor output layers are called hidden layers. We enumerate the hidden layers from left to right as : 1 st hidden layer, 2 nd hidden layer, 3 rd hidden layer, and so on.
From now on Q will always denote a quiver with d input vertices and k output vertices.
Definition 4.2. A network quiver Q is a quiver arranged by layers such that: 1. There are no loops on source (i.e., input and bias) nor sink vertices.
2. There is exactly one loop on each hidden vertex.
3. Other than these loops, there are no more oriented cycles.
An example of a network quiver can be found in Fig. 3(a).
Definition 4.3. The delooped quiver Q • of Q is the quiver obtained by removing all loops of Q. We denote Q When a neural network computes a forward pass (be it a multilayer perceptron or a convolutional neural network), the weight between two neurons is used to multiply the output signal of the first neuron and the result is fed to the other neuron. Since multiplying two numbers is a linear map, we get that a weight is used as a linear map between two 1D vector spaces. Therefore the weights of a neural network define a thin quiver representation of the delooped quiver Q • of its network quiver Q, every time it computes a prediction.
When a neural network computes a forward pass, we get a combination of two things: 1. A thin quiver representation.
We will encode the point-wise usage of activation functions as maps assigned to the loops of a network quiver. An example of neural network (W, f ) over a network quiver Q can be seen in Fig. 3(b). The words neuron and unit refer to the combinatorics of a vertex together with its activation function in a neural network over a network quiver. The weights of a neural network (W, f ) are the complex numbers defining the maps W for all ∈ E.
Please note that since in practice neural networks account for real values, our usage of complex weights should be seen as a generalization over the usual definition of neural networks. When computing a prediction, we have to take into account two things: • The activation function is applied to the sum of all input values of the neuron.
• The activation output of each vertex is multiplied by each weight going out of that neuron.
Definition 4.5. Let (W, f ) be a neural network over a network quiver Q and let x ∈ C d be an input vector of the network. Denote by ζ v the set of edges of Q with target v. The activation output of the vertex v ∈ V with respect to x after applying a forward pass is denoted a(W, f ) v (x) and is computed as follows: For our purposes, it is convenient to consider no activation functions on the output vertices. This is consistent with current deep learning practices as one can consider the activation functions of the output neurons to be part of the loss function (like softmax + cross-entropy or as done by Dinh et al. (2017)).
Definition 4.6. Let (W, f ) be a neural network over a network quiver Q. The network function of the neural network is the function where the coordinates of Ψ(W, f )(x) are the activation outputs of the output vertices of (W, f ) (often called the "score" of the neural net) with respect to an input vector x ∈ C d .
We now define maps that behave well with respect to the structure of neural networks, namely the thin quiver representation and the activation functions at the same time.
Definition 4.7. Let (W, f ) and (V, g) be neural networks over the same network quiver Q. A morphism of neural networks τ : (W, f ) → (V, g) is a morphism of thin quiver representations τ : W → V such that τ v = 1 for all v ∈ V that is not a hidden vertex, and for every hidden vertex v ∈ V the following diagram is commutative is an isomorphism of quiver representations. We say that two neural networks over Q are isomorphic if there exists an isomorphism of neural networks between them.
Every morphism of neural networks is an isomorphism of neural networks due to the condition that the change of basis is equal to the identity on the source and output vertices.
Remark 4.8. The terms 'network morphism' (Wei et al., 2016), 'isomorphic neural network' and 'isomorphic network structures' Stagge and Igel, 2000) have already been used with different approaches. In this work, we will not refer to any of those terms.
Denote by Q = ( V, E, s, t) the hidden quiver of Q, which is given by the hidden vertices V of Q and all the oriented edges E between hidden vertices of Q that are not loops. Said otherwise, Q is the same as the delooped quiver Q • but without the source and sink vertices. The group of change of basis for neural networks is denoted as Note that this group has as many factors as hidden vertices of Q. An element of the change of basis group G is called a change of basis of the neural network (W, f ). Given an element τ ∈ G we can induce τ ∈ G, where G is the change of basis group of thin representations over the delooped quiver Q • . We do this by assigning τ v = 1 for every v ∈ V that is not a hidden vertex. Therefore, we will simply write τ for elements of G considered as elements of G.
The action of the group G on a neural network (W, f ) is defined on a given element τ ∈ G and a neural network (W, f ) by where τ · W is the thin representation such that for each edge ∈ E, the linear map τ · W = W τ t( ) τ s( ) following the group action of Eq.(1), and the activation τ · f on the hidden vertex v ∈ V is given by Observe that (τ · W, τ · f ) is a neural network such that τ : (W, f ) → (τ · W, τ · f ) is an isomorphism of neural networks. This leads us to the following theorem, which is an important corner stone of our paper. Please refer to Appendix A for an illustration of this proof.
We proceed with a forward pass to compute the activation outputs of both neural networks with respect to the same input vector. Let x ∈ C d be the input vector of the networks, for every source vertex v ∈ V we have Now let v ∈ V be a vertex in the first hidden layer and ζ v the set of edges between the source vertices and v ∈ V, the activation output of v in (W, f ) is As an illustration, if (W, f ) is the neural network of Fig. 3, the source vertices would be a, b, f , the first hidden layer vertices would be c, d, e and the weights W in the previous Assume now that v ∈ V is in the second hidden layer (e.g., vertex g or h in Fig. 3 Finally, the coordinates of Ψ(W, f )(x) are the activation outputs of (W, f ) on the output vertices, and analogously for Ψ(V, g)(x). Since τ v = 1 for every output vertex v ∈ V, we get that which proves that an isomorphism between two neural networks (W, f ) and (V, g) preserves the network function.

Consequences
Representing a neural network over a network quiver Q by a pair (W, f ) and Theorem (4.9) have two consequences on neural networks.

Consequence 1 If each neuron of a neural network is assigned a change of basis value
τ v ∈ C, its weights W can be transformed to another set of weights V following the group action of Eq.(1). Similarly, the activation functions f of that network can be transformed to other ones g following the group action of Eq.
(2). For example, if f is ReLU and t v is a negative real value, than g becomes an inverted-flipped ReLU function, i.e., min(0, x). From the usual neural network representation stand point, the two neural networks (W, f ) and (V, g) are different as their activation functions f and g are different and their weights W and V are different. Nonetheless, their function (i.e., the output of the networks given some input vector x) is rigorously identical. And this is true irregardless of the structure of the neural network, its activation functions and weight vector W . Said otherwise, Theorem 4.9 implies that a function of a neural network is not unique and that an [infinite] amount of other neural networks with different weights and different activation functions have the same function and that these other neural networks may be recovered with the change of basis group G.
Consequence 2 A weak version of Theorem (4.9) proves a property of ReLU networks known as positive scale invariance or positive homogeneity (Badrinarayanan et al., 2015;Dinh et al., 2017;Yi et al., 2019;Yuan and Xiao, 2019). Positive scale invariance is a property of ReLU non-linearities, where the network function remains unchanged if we (for example) multiply the weights in one layer of a network by a positive factor, and divide the weights on the next layer by that same positive factor. Even more, this can be done on a per neuron basis. Namely, assigning a positive factor r > 0 to a neuron and multiplying every weight that points to that neuron with r, and dividing every weight that starts on that neuron by r.
As a consequence, (τ · W, f ) and (W, f ) are isomorphic neural networks. In particular, they have the same network function,

Architecture
In this section, we first outline the different types of architectures that we consider. We also show how the commonly used layers for neural networks translate into quiver representations. Finally, we will present in detail how an isomorphism of neural networks can be chosen so that the structure of the weights gets preserved.

Types of architectures
Definition 5.1. (Goodfellow et al., 2016, page 193) The architecture of a neural network refers to its structure which accounts for how many units (neurons) it has and how these units are connected together.
For our purposes, we distinguish three types of architectures: combinatorial architecture, weight architecture and activation architecture.
Definition 5.2. The combinatorial architecture of a neural network is given its network quiver of the neural network. The weight architecture is given by constraints on how the weights are chosen. The activation architecture is the choice of activation function on each loop of the network quiver.
If we consider the neural network of Fig. 3, the combinatorial architecture deals on how the vertices are connected together, the weight architecture on how the weights W are assigned and the activation architecture deals with the activation functions f v .
Two neural networks may have different combinatorial, weight and activation architecture like ResNet (He et al., 2016) vs VGGnet (Simonyan and Zisserman, 2015) for example. Neural network layers may also have the same combinatorial architecture but a different activation and weight architecture. It is the case for example of a mean pooling layer vs a convolution layer. While they both encode a convolution (same combinatorial architecture) they have a different activation architecture (as opposed to conv layers, mean pooling has no activation function) and a different weight architecture as the mean pooling weights are fixed and not learnable. Overall, two neural networks have globally the same architecture if and only if they share the same combinatorial, weight, and activation architectures.
Also, isomorphic neural networks always have the same combinatorial architecture, since isomorphisms of neural networks are defined over the same network quiver. However, an isomorphism of neural networks can change the weight and the activation architecture. We will come back on that concept at the end of this section.

Neural network layers
Here, we look at how fully-connected layers, convolutional layers, pooling layers, batch normalization layers and residual connections are related to the quiver representation language.
Let V j be the set of vertices on the j-th hidden layer of Q. A fully connected layer is a hidden layer V j where all vertices on the previous layer are connected to all vertices in V j . A fully connected layer with bias is a hidden layer V j that puts constraints on the previous layer V j−1 . First, the non-bias vertices of V j−1 are fully connected with the nonbias vertices of layer V j . Secondly, there is a bias vertex in V j−1 connected to every vertex of V j . A fully connected layer has no constraints on its weight and activation architecture but impose that the bias vertex has no activation function and not connected with the vertex of the previous layer. The reader can find an illustration of this in Fig. 4.
A convolutional layer is a hidden layer V j whose vertices are separated in channels (or feature maps). The weights are typically organized in filters (F n ) m n=1 , and each F n is a tensor also partitioned into m n channels. The constraints on the layer V j−1 is that it should be partitioned into m n channels of the same cardinality. Each filter F n produces a channel on the layer V j by a convolution of V j−1 with the filter F n . Also, a convolution operation has a stride and may use padding.
As opposed to the fully connected layer, a convolutional layer has some constraints on its combinatorial and weight architecture. First, each V j is connected to a sub-set of vertices in the previous layer "in front" of which it is located. The combinatorial architecture of a conv layer for one feature map is illustrated in Fig. 5(a). Second, the weight architecture In other words, the weights of the edges on a conv layer must be shared across all filters as in Fig. 5(b). A conv layer with bias is a hidden layer V j partitioned into channels, where each channel is obtained by convolution of V j−1 with each filter F n , n = 1, ..., m, plus one bias vertex in layer V j−1 that is connected to every vertex on every channel of V j . The weights of the edges starting on the bias vertex should repeat within the same channel. Again, bias vertices do not have an activation function and are not connected to neurons of the previous layer.
The combinatorial architecture of a pooling layer is the same as that of a conv layer, see Fig. 5(a). However, since the purpose of that operation is usually to reduce the size of the previous layer, it contains non-trainable parameters. Thus, pooling layers have a different weight architecture than the conv layers. Average pooling fixes the weights in a layer to 1/n where n is the size of the feature map, while max-pooling force to choose only one non-zero weight on each window for each input vector. Also, the activation function of a average and max-pooling layer is the identity function. This can be appreciated in Fig. 5(c) and (d).
Remark 5.3. Max-pooling layers are compatible with our constructions, but they force us to consider neural networks (W, f ) where the weights on the max-pooling layer are not fixed until the network is fed with an input. This complicates the algebraic methods needed to model it as we need to consider sets of neural networks (one for each possibility of weights in the max-pooling layer) instead of just one.
It is known that max-pooling layers give small amount of translation invariance at each level since the precise location of the most active feature detector is thrown away, and this produces doubts about the use of max-pooling layers (see Hinton, 2014;Sabour et al., 2017). An alternative to this is the use of attention-based pooling (Kosiorek et al., 2019), which is a global-average pooling. Our interpretation provides a framework that supports why these doubts about the use of max-pooling layers exists: they break the algebraic structure of a neural network. However, average pooling layers, and therefore global-average pooling layers, are perfectly consistent with respect to our results since they are given by fixed weights for any input vector.
Batch normalization layers (Ioffe and Szegedy, 2015) require specifications on the three types of architecture. Their combinatorial architecture is given by two identical consecutive hidden layers where each neuron on the first is connected to only one neuron on the second, and there is one bias vertex in each layer. The weight architecture is given by the batch norm operation, which is x → x − µ σ 2 γ + β where µ is the mean of a batch and σ 2 its variance, and γ and β are learnable parameters. The activation architecture is given by two identity activations. This can be seen in Fig. 6.
Remark 5.4. Like max-pooling, the weights µ and σ cannot be determined until the network is fed with a batch of data. But that is true only at training time. Once training is over, µ and σ are set to the overall mean and variance computed across the training data set and thus become normal weights.
The combinatorial architecture of a residual connection (He et al., 2016) requires the existence of edges in Q that jump over one or more layers. Their weight architecture forces the weights chosen for those edges to be always equal to 1. We refer to Fig. 7 for an illustration of the architecture of a residual connection.

Architecture preserved by isomorphisms
Two isomorphic neural networks can have different weight architectures. Let us illustrate this with a residual connection. Let Q be the following network quiver a b c d e α β γ δ and the neural network (W, f ) over Q given by Let τ b = τ d be non-zero numbers, we define a change of basis of the neural network (W, f ) by τ = (1, τ b , 1, τ d , 1). After applying the action of the change of basis τ · (W, f ) we obtain an isomorphic neural network given by The neural networks (W, f ) and τ · (W, f ) are isomorphic and therefore they have the same network function by Theorem 4.9. However, the neural network (W, f ) has a residual connection, while τ ·(W, f ) does not since the weight on the skip connection is not equal to 1. Nevertheless, if we take τ b = τ d , then the change of basis τ = (1, τ b , 1, τ b , 1) will produce an isomorphic neural network with a residual connection, and therefore both neural networks (W, f ) and τ · (W, f ) will have the same weight architecture. The same phenomenon as for residual connections happens for convolutions, where one has to choose a specific kind of isomorphism to preserve the weight architecture, as shown in Fig. 8. Isomorphisms of neural networks preserve the combinatorial architecture but not necessarily the weight architecture nor the activation architecture.

Consequences
As for the previous section, expressing neural network layers as quiver representation has some consequences. Let us mention two.
Consequence 1 The first consequence derives from the isomorphism of residual layers. It is claimed by  that there is no positive scale invariance across residual blocks. However, we can see that the quiver representation language allows us to prove that in fact there is positive scale invariance across residual blocks for ReLU networks. Figure 8: An illustration of a convolutional layer. The black arrows with target g, h, i and j correspond to the first channel, and the gray arrows with target k, l, m and n correspond to the second channel. A change of basis τ ∈ G that preserves the weight architecture of this convolutional layer, has to be of the form τ = (τ i ) m i=a where τ g = τ h = τ i = τ j and τ k = τ l = τ m = τ n .

Consequence 2
The second consequence is related to the isomorphism of convolutional layers. As in Fig. 8, a change of basis τ ∈ G that preserves the weight architecture of this convolutional layer, has to be of the form τ = (τ i ) m i=a where τ g = τ h = τ i = τ j and τ k = τ l = τ m = τ n . This is what  do for the particular case of ReLU networks and positive change of basis. While positive scale invariance of ReLU networks is a special kind of isomorphisms of neural networks that preserve both the weight and the activation architecture, we may generalize this notion as follows.
Definition 5.5. Let (W, f ) be a neural network and let τ ∈ G be an element of the group of change of basis of neural networks such that the isomorphic neural network τ · (W, f ) has the same weight architecture as (W, f ). The teleportation of the neural network (W, f ) with respect to τ is the neural network τ · (W, f ).
Teleportation produces a neural network with the same combinatorial architecture, weight architecture and network function while it may change the activation architecture. For example, consider a neural network with ReLU activations and real change of basis. Since ReLU is positive scale invariant, any positive change of basis will leave ReLU invariant. On the other hand, for a negative change of basis the activation changes to min(0, x) and therefore the weight optimization landscape also changes. This implies that teleportation may change the optimization problem while preserving the network function, and the network gets 'teleported ' to either other place in the same loss landscape or to a completely different loss landscape.

Data Representations
In machine learning, a data sample is usually represented by a vector, a matrix or a tensor containing a series of observed variables. However, one may view data from a different perspective, namely the neuron outputs obtained after a forward pass, also known as "feature maps" for conv nets (Goodfellow et al., 2016). This has been done in the past to visualize what neurons have learned (Yosinski et al., 2015).
In this section, we propose a mathematical description of the data in terms of the architecture of the neural network, i.e., the neuron values obtained after a forward pass. We shall prove that doing so allows to represent data by a quiver representation. Our approach is different from representation learning (Goodfellow et al., 2016, page 4) because we do not focus on how the representations are learned but rather on how the representations of the data are encoded by the forward pass of the neural network.
Definition 6.1. A labeled data set is given by a finite set D = {(x i , t i )} n i=1 of pairs such that x i ∈ C d is a data vector (could also be a matrix or a tensor) and t i is a target. We can have t i ∈ C k for a regression and t i ∈ {C 0 , C 1 , ..., C k } k for a classification.
Let (W, f ) be a neural network over a network quiver Q and a sample (x, t) of a data set D. When the network processes the input x, the vector x percolates through the edges and the vertices from the input to the output of the network. As mentioned before, this results into neuron values (or feature maps) that one can visualize (Yosinski et al., 2015). On its own, the neuron values are not a quiver representation per se. However, one can combine these neuron values with their pre-activations and the network weights to obtain a thin representation with identity activation. Since that representation derives from the forward pass of x, it is specific to it.
Remark 6.2. Every thin quiver representation V of the delooped quiver Q • defines a neural network over the network quiver Q with identity activations, that we denote (V, 1). We do not claim that taking identity activation functions for a neural network will result into something good in usual deep learning practices. This is only a theoretical trick to manipulate the underlying algebraic objects we have constructed.
Our data representation for x is a thin representation that we call W f x with identity activations whose function when fed with an input vector of ones 1 d := (1, ..., 1) ∈ C d satisfies where Ψ(W, f )(x) is the score of the network (W, f ) after a forward pass of x.
Recovering W f x given the forward pass of x through (W, f ) is illustrated in Fig. 9 (a) and (b). Lets keep track of the computations of the network in the thin quiver representation W f x and remember that at the end, we want the output of the neural network (W f x , 1) when fed with an input vector 1 d ∈ C d , to be equal to Ψ(W, f )(x).
If ∈ E is an oriented edge such that s( ) ∈ V is a bias vertex, then the computations of the weight corresponding to get encoded as W f x = W . If on the other hand s( ) ∈ V is an input vertex, then the computations of the weights on the first layer get encoded as W f x = W x s( ) , see Fig. 9(b).
On the second and subsequent layers of the network (W, f ) we encounter activation functions. Also, the weight corresponding to an oriented edge in W f x will have to cancel the unnecessary computations coming from the previous layer. That is, W f x has to be equal to W times the activation output of the vertex s( ) divided by the pre-activation of s( ).
Overall, W f x is defined as The induced thin quiver representation W f x considered as a neural network (W f x , 1) and obtained after feed-forwarding x through (W, f ). It can be seen that feed-forwarding a unit vector 1 through W f x (i.e. Ψ(W f x , 1)(1)) gives the same output than feed-forwarding x through (W, f ) : Ψ(W f x , 1)(1) = Ψ(W, f )(x). We refer to Theorem 6.4 for the general case. (c) In the case W α x = 0, we can add 1 to the corresponding pre-activation in W f x to prevent from a division by zero, while on the next layer we consider W α x + 1 as the pre-activation.
where ζ s( ) is the set of arrows of Q with target s( ). In the case where the activation function is ReLU, for an oriented edge such that s( ) is a hidden vertex, either W f Remark 6.3. Observe that the denominator is the pre-activation of vertex s( ) and can be equal to zero. However, the set where this happens is of measure zero. And even in the case that it turns out to be exactly zero, one can add a number η = 0 (for example η = 1) to make it non-zero and then consider η as the pre-activation of that corresponding neuron, see Fig. 9(c). So we will assume, without loss of generality, that pre-activations of neurons are always non-zero.
The quiver representation W f x of the delooped quiver Q • accounts for the combinatorics of the history of all the computations that the neural network (W, f ) performs on a forward pass given the input x. The main property of the quiver representation W f x is given by the following result. A small example of the computation of W f x and a view into how the next Theorem works, can be found in Appendix B.
Theorem 6.4. Let (W, f ) be a neural network over Q, let (x, t) be a data sample for (W, f ) and consider the induced thin quiver representation W f x of Q • . The network function of the neural network (W f x , 1) satisfies Proof. Obviously, both neural networks have different input vectors, that is, 1 d for (W f x , 1) and x for (W, f ). If v ∈ V is a source vertex, by definition a(W f x , 1) v (1 d ) = 1. We will show that in the other layers, the activation output of a vertex in (W f x , 1) is equal to the pre-activation of (W, f ) in that same vertex. Assume that v ∈ V is in the first hidden layer, let ζ bias v be the set of oriented edges of Q with target v and source vertex a bias vertex, and let ζ input v be the set of oriented edges of Q with target v and source vertex an input vertex. Then, for every is the pre-activation of vertex s( ) in (W, f ), by the above formula we get that which is the pre-activation of vertex v in (W, f ) when fed with the input vector x. That is, . An induction argument gives the desired result since the output layer has no activation, and the coordinates of Ψ(W f x , 1)(1 d ) and Ψ(W, f )(x) are the values of the output vertices.
Corollary 6.5. Let (x, t) and (x , t ) be data samples for (W, f ). If the quiver representations W f x and W f x are isomorphic via G then Ψ(W, f )(x) = Ψ(W, f )(x ).
Proof. The neural networks (W f x , 1) and (W f x , 1) are isomorphic if and only if the quiver representations W f x and W f x are isomorphic via G. By the last Theorem and the fact that isomorphic neural networks have the same network function (Theorem 4.9) we get that

Consequences
Interpreting data as a quiver representations several consequences.

Consequence 1
The combinatorial architecture of (W, f ) and of (W f x , 1) are equal, and the weight architecture of (W f x , 1) is determined by both the weight and activation architectures of the neural network (W, f ) when its fed the input vector x.
Consequence 2 By Corollary 6.5, we obtain that the neural network is representing the data as the isomorphism classes W f x of the thin quiver representations W f x under the action of the change of basis group G of neural networks. Said otherwise, if a data x is represented by a thin quiver representation W f x , one can generate an infinite amount of new data W f x via G which all have the same network output. Doing so could have important implications in the field of adversarial attacks and network fooling (Akhtar and Mian, 2018) where one could generate fake data at will which, when fed to a network, all have exactly the same output than the original data x.
Consequence 3 It is well known that feature maps can be visualized into images showing the input signal characteristics and thus providing intuitions on the behavior of the network and its impact on an image (Yosinski et al., 2015;Feghahati et al., 2019). This notion is strongly supported by our findings as our thin quiver representations of data include both the network structure and the data (see Eq. (5) in page 21 and the proof of Theorem 6.4). As such, using a thin quiver representation opens the door to a formal (and less intuitive) way to understand the interaction between data and the structure of a network.

The Moduli Space of a Neural Network
In this section, we propose a modified version of the manifold hypothesis of Goodfellow et al. (2016, section 5.11.3). The original manifold hypothesis claims that the data lies in a small dimensional manifold inside the input space. Although this hypothesis seems intuitive and the empirical findings support its formulation, the existence and the structure of such manifold is mathematically vague. Our manifold hypothesis will put more geometric structure on the manifold formed by algebraic objects, and we will give an explicit map whose image generates a manifold containing the data quiver representations of a data set. In order to formalize our manifold hypothesis, we will attach an explicit geometrical object to every neural network (W, f ) over a network quiver Q, that will contain the isomorphism classes of the data quiver representations W f x induced by any kind of data set D. This geometrical object that we denote d M k ( Q) is called the moduli space. The moduli space only depends on the combinatorial architecture of the neural network, while the activation and weight architectures of the neural network determine how the isomorphism classes of the data quiver representations W f x are distributed inside the moduli space. The mathematical objects required to formalize our manifold hypothesis are known as framed quiver representations. We will follow Reineke (2008) for the construction of framed quiver representations in our particular case of thin representations. Recall that the hidden quiver Q = ( V, E, s, t) of a network quiver Q is the sub-quiver of the delooped quiver Q • formed by the hidden vertices V and the arrows E between hidden vertices. Every thin representation of the delooped quiver Q • induces a thin representation of the hidden quiver Q by forgetting the arrows whose source is an input (or bias) vertex, or the target is an output vertex. In this section we call input vertices of Q the vertices of Q that are connected to the input vertices of Q, and we call output vertices of Q the vertices that are connected to the output vertices of Q. Let d 1 be the number of input vertices of Q and let d L be the number of output vertices of Q. That is, there are d 1 neurons in the first hidden layer, and d L neurons in the last hidden layer.
Remark 7.1. For the sake of simplicity, we will assume that there are no bias vertices in the quiver Q. If there are bias vertices in Q, we can consider them as part of the input layer in such a way that every input vector x ∈ C d needs to be extended to a vector x ∈ C d+b with its last b coordinates all equal to 1, where b is the number of bias vertices. All the quiver representation theoretic arguments made in this section are therefore valid also for neural networks with bias vertices under these considerations. This also has to do with the fact that the group of change of basis of neural networks G has no factor corresponding to bias vertices, as the hidden quiver is obtained by removing all source vertices, not only input vertices.
Let W be a thin representation of Q. We fix once and for all a family of vector spaces {V v } v∈ V indexed by the vertices of Q, given by V v = C k when v is an output vertex of Q and V v = 0 for any other v ∈ V. The choice of linear maps h v : W v → V v for every v ∈ V determines what is known as a framed quiver representation of Q by the family of vector spaces {V v } v∈ V (Reineke, 2008). We can see that h v is equal to the zero map when v is not an output vertex of Q, and therefore h = (h v ) v∈ V can be represented by a matrix from the output layer of W to C k , so that we can omit the vector spaces V v from the notation of h. Observe that h v : C → C k for v an output vertex of Q, and if h is non-zero then there is at least one h v that is non-zero for v an output vertex of Q.
Dually, we can fix a family of vector spaces {U v } v∈ V indexed by V and given by U v = C d when v is an input vertex of Q and U v = 0 for any other v ∈ V. The choice of linear maps v : U v → W v for every v ∈ V determines a co-framed quiver representation of Q by the family of vector spaces {U v } v∈ V . We can see that v is the zero map when v is not an input vertex of Q, and therefore = ( v ) v∈ V can be represented by a matrix from C d to the input layer of W , so that we can omit the vector spaces U v from the notation of . Observe that v : C d → C for v an input vertex of Q, and if is non-zero then there is at least one v that is non-zero for v ∈ V an input vertex of Q. If we were to consider bias vertices in Q, the weights corresponding to the bias vertices will form part of , and will no longer be represented by a matrix, but of course the quiver representation arguments will still hold.
We will consider both h and as matrices with coordinates h i,j and i,j respectively. Figure 10: An illustration of a double-framed thin quiver representation ( , W , h). The boxes define the vector spaces of the framing and co-framing, given by C d and C k . Each input vertex of Q is connected to C d and each output vertex of Q is connected to C k .
Definition 7.2. A double-framed thin quiver representation is a triple ( , W , h) where W is a thin quiver representation of Q and : C d → C d 1 and h : C d L → C k are non-zero linear maps (see Fig. 10). We let d R k ( Q) be the set of double-framed thin quiver representations of Q.
The group of change of basis of double-framed thin quiver representations is the same group G of change of basis of neural networks. The action of G on double-framed quiver representations for τ ∈ G is given by where the matrices τ · h and τ · are given by (τ · h) i,j = h i,j /τ i and (τ · ) i,j = i,j τ j . Every double-framed thin quiver representation of Q isomorphic to ( , W , h) is of the form τ · ( , W , h) for some τ ∈ G. In the following theorem, we show that instead of studying the isomorphism classes W f x of the thin quiver representations of the delooped quiver Q • induced by the data, we can study the isomorphism classes of double-framed thin quiver representations of the hidden quiver. Proof. The correspondence between isomorphism classes is due to the equality of the group of change of basis for neural networks and double-framed thin quiver representations, since the isomorphism classes are given by the action of the same group. Given a thin representation W of the delooped quiver, it induces a thin representation W of the hidden quiver Q by forgetting the input and output layers of Q. Moreover, if we consider the input vertices of Q as the coordinates of C d and the output vertices of Q as the coordinates of C k , then the weights starting on input vertices of Q define the map while the weights ending on output vertices of Q define the map h. This can be seen in Fig. 10. Given a double-framed thin quiver representation ( , W , h), the entries of the matrix (resp. h) are the weights of a thin representation W starting (resp. ending) on input (resp. output) vertices, while W defines the other weights of W . From now on we will identify a double-framed thin quiver representation ( , W , h) with the thin representation W of the delooped quiver Q • defined by ( , W , h) as in the proof of the last theorem. We will also identify the isomorphism classes where the symbol on the left means the isomorphism class of the thin representation W under the action of G, and the one on the right is the isomorphism class of the double-framed thin quiver representation ( , W , h).
Before using Nakajima (1996)'s theorem on the existence of the moduli space, and Reineke (2008)'s calculation of the dimension of the moduli space, we will need to consider subrepresentations of a quiver representation.
Definition 7.4. (Schiffler, 2014, page 14) Let W be a thin representation of the delooped quiver Q • of a network quiver Q. A sub-representation of W is a representation U of Q • such that there is a morphism of representations τ : U → W where each map τ v is an injective map.
The zero representation of Q is the representation denoted 0 where every vector space assigned to every vertex is the zero vector space. Note that if U is a quiver representation, then the zero representation 0 is a sub-representation of U since τ v = 0 for all vertices v is an injective map.
We can see from Fig. 11 that the combinatorics of the quiver are related to the existence of sub-representations. For instance, a representation U such that U v = 0 on one single output vertex v cannot be a sub-representation of a thin representation unless U v = 0 for all v. This fact will be used in the proof of the following theorem, together with its dual notion. Namely, the only way a representation U can be a sub-representation of a thin representation with U v = C for at least one input vertex v is when U v = C for all vertices v. For example, in Fig. 11(c) we can see that the top representation is not a sub-representation of the bottom thin representation.
The existence of the moduli space depends on a chosen notion of stability for doubleframed quiver representations (Reineke, 2008;Nakajima, 1996). We will introduce this notion and then prove that every double-framed thin quiver representation induced by a neural network and a data sample is stable.
Remark 7.5. Usually, one does either a framing or a co-framing, and chooses a stability condition for each one. In our case, we will do both at the same time, and use the definition of stability given by (Reineke, 2008) for framed representations, together with its dual notion of stability for co-framed representations.
Given a double-framed thin quiver representation ( , W , h), the image of the map lies inside the representation W . The map is given by a family of maps indexed by the vertices of the hidden quiver Q, namely = { v : C nv → W v | v ∈ V}. Recall that n v = 0 if v is not an input vertex of the hidden quiver Q, and n v = d when v is an input vertex of Q. The image of is by definition a family of vector spaces indexed by the hidden quiver Q, given by By definition Im( ) v = {z ∈ W v = C | v (w) = z for some w ∈ C nv }, and therefore Im( ) v is non-zero for at least one input vertex v of Q. Dually, the kernel of the map h lies inside the representation W . The map h is given by a family of maps indexed by the vertices of the hidden quiver Q, Recall that m v = 0 if v is not an output vertex of the hidden quiver Q, and m v = k when v is an output vertex of Q. Therefore, the kernel of h is by definition a family of vector spaces indexed by the hidden quiver Q. That is, , and therefore ker(h) v is zero for at leas one output vertex v of Q.
Definition 7.6. A double-framed thin quiver representation ( , W , h) is stable if the following two conditions are satisfied: 1. The only sub-representation U of W which is contained in ker(h) is the zero subrepresentation, 2. The only sub-representation U of W that contains Im( ) is W .
Theorem 7.7. Let (W, f ) be a neural network and let (x, t) be a data sample for (W, f ).
Then the double-framed thin quiver representation W f x is stable.

Proof.
We express W f x = ( , W , h) as in Theorem 7.3. We will assume, without loss of generality, that there is at least one map h v that is non-zero, and that there is at least one map v that is non-zero.
The kernel ker(h) is zero in at least one output vertex of Q and, as in Fig. 11, after the combinatorics of quiver representations, there is no sub-representation of W with at least one zero on an output vertex of Q other than the zero representation. Since the combinatorics of network quivers forces a sub-representation contained in ker(h) to be the zero sub-representation we obtain the first condition for stability of double-framed thin quiver representations.
Dually, the image Im( ) v is equal to W v = C for at least one input vertex v ∈ V. After the combinatorics of quiver representations, as in Fig. 11, there is no sub-representation of W that contains Im( ) other than W . Therefore the only sub-representation of W that contains Im( ) is W . Thus, W f x = ( , W , h) is a stable double-framed thin quiver representation of the hidden quiver Q.
The moduli space of stable double-framed thin quiver representations of Q is by definition Therefore, we have proved that given a neural network (W, f ) and a data set D, the isomorphism classes of the data quiver representations W f x lie inside the moduli space. This fact doesn't depend on the weights of the network, the activations, the chosen architecture, the training nor the chosen data set. It only depends on the quiver Q, that is, the combinatorial architecture of the network (W, f ).
The following result is a particular case of Nakajima (1996)'s theorem, generalized for double-framings and restricted to thin representations, combined with Reineke (2008)'s calculation of framed quiver moduli space dimension adjusted for double-framings (see Appendix C for details about the computation of this dimension).
Theorem 7.8. Let Q be a network quiver. There exists a geometric quotient d M k ( Q) of stable representations of d R k ( Q) by the action of the group G, called the moduli space of stable double-framed thin quiver representations of Q. Moreover, d M k ( Q) is non-empty and In short, the dimension of the moduli space of the hidden quiver Q equals the number of edges of Q • minus the number of hidden vertices minus 1.
Remark 7.9. The mathematical existence of the moduli space (Reineke, 2008;Nakajima, 1996) depends on two things, • the models are build upon the complex numbers C, and • the change of basis group of neural networks G is the change of basis group of thin quiver representations of Q.
One may try to study instead the space whose points are isomorphism classes given by the action of the sub-group H of the change of basis group G, whose action preserves both the weight and the activation architectures. By doing so we obtain a group H that is not the change of basis group of quiver representations, which gets in the way of the construction, and therefore the existence, of the moduli space. This happens even in the case of ReLU activation.

Consequences
Consequence 1 The dimension of the moduli space is the number of parameters of the network minus the number of hidden neurons minus one. The dimension of the moduli space plus 1 is equal to the number of basis paths in ReLU networks found by , where they empirically confirm that the number of basis paths is a good measure for generalization. The dimension of the moduli space plus one was also obtained as the rank of a structure matrix for paths in a ReLU network .

Consequence 2
The moduli space d M k ( Q) as a set is given by That is, the points of the moduli space are the isomorphism classes of (stable) double-framed thin quiver representations of Q over the action of the change of basis group G of neural networks. Therefore, for every data sample (x, t) the neural network (W, f ) induces a point Consequence 3 Let (W, f ) be a neural network over Q and let (x, t) be a data sample. If W f x = 0, then any other quiver representation V of the delooped quiver Q • that is there are many samples (x, t) such that for a specific edge ∈ Q • the corresponding linear map on W f x is zero, then the coordinates of W f x inside the moduli space corresponding to are not used. Therefore, a projection of those coordinates to zero corresponds to the notion of pruning of neural networks, that is forcing to zero the smaller weights on a network (Frankle and Carbin, 2019). From Eq. (5) we can see that this interpretation of the data explains why naive pruning works.
Consequence 4 We proceed to formalize our manifold hypothesis of the data in terms of the moduli space. Consider a network quiver Q and a neural network (W, f ) over Q. We define a map ϕ(W, f ) : Given a data set D, its input vectors define a finite family of points x 1 , ..., x n ∈ C d . By definition Then all the isomorphism classes W f x 1 , ..., W f xn lie inside the image of the map ϕ(W, f ), which is a sub-space of the moduli space. In symbols this becomes The image of ϕ(W, f ) inside the moduli space d M k ( Q) generates a sub-manifold of d M k ( Q) that contains Im ϕ(W, f ) so the data quiver representations W f x 1 , · · · , W f xn all lie in this sub-manifold. Therefore, our manifold hypothesis claims the existence of a sub-manifold of d M k ( Q) that contains the set W f x 1 , · · · , W f xn of data quiver representations. And our results prove the existence of such manifold.
Consequence 5 We can use the moduli space to formulate what training does to the data quiver representations. Training a neural network through gradient descent generates an iterative sequence of neural networks (W 1 , f ), (W 2 , f ), ..., (W m , f ) where m is the total number of training iterations. For each gradient descent iteration i = 1, ..., m we have The moduli space is given only in terms of the combinatorial architecture of the neural network, while the weight and activation architectures determine how the points W f x 1 , ..., W f xn are distributed inside the moduli space d M k ( Q). Since the training changes the weights and not the network quiver, we obtain that each training step defines a different map A training of a neural network, which is a sequence of neural networks (W 1 , f ), ..., (W m , f ), can be thought as, first adjusting the manifold Im ϕ(W 1 , f ) into Im ϕ(W 2 , f ) , then the manifold Im ϕ(W 2 , f ) into Im ϕ(W 3 , f ) , and so on. Therefore the training becomes an adjustment of the sub-manifold Im ϕ(W i , f ) inside the moduli space d M k ( Q).
Consequence 6 A training of the form (W 1 , f ), ..., (W m , f ) only changes the weights of the neural network. As we can see, our data quiver representations depend on both the weights and the activations, and therefore a usual training does not exploits completely the fact that the data quiver representations are mapped via ϕ to the moduli space. Thus, the idea of learning the activation functions, as it is done by Goyal et al. (2019), will produce a training of the form (W 1 , f 1 ), ..., (W m , f m ), and this allows the maps ϕ(W i , f i ) to explore more the moduli space than the case where only the weights are learned. Therefore, our results open up the door to a new way of training the weights and the activation functions of a neural network.

Conclusion and Future Work
We presented the theoretical foundations for a different understanding of neural networks using their combinatorial and algebraic nature, while explaining current intuitions in deep learning by relying only on the mathematical consequences of the computations of the network. We may summarize our work with the following five points, 1. We use quiver representations to represent a neural network.
2. This representation of neural networks scales to modern deep architectures like conv layers, pooling layers, residual layers and batch normalization.
3. Theorem 4.9 shows that neural networks are algebraic objects, in the sense that the maps preserving the algebraic structure also preserve the computations of the network. Even more, we show that positive scale invariance of ReLU networks is a particular case of this result.
4. We represented data as thin quiver representations with identity activations in terms of the architecture of the network. We proved that this representation of data is algebraically consistent (invariant under isomorphisms) and carries the important notion of feature spaces, all at the same time.
5. We introduced the moduli space of a neural network, and prove that it contains all possible (isomorphism classes of) thin quiver representations that result from the computations of the neural network on a forward pass. This leads us to the mathematical formalization of a modified version of the manifold hypothesis in machine learning, given in terms of the architecture of the network.
To the knowledge of the authors, the insights, concepts and results in this work are the first of their kind. In the future, we aim to translate more deep learning objects into the quiver representation language. For instance, • Dropout (Srivastava et al., 2014) is a restriction of the training to several network sub-quivers. This translates into adjustments of the configuration of the data inside the moduli space via sub-spaces given by sub-quivers.
• Generative adversarial networks (Goodfellow et al., 2014) and actor-critics (Silver et al., 2014) provide the stage for the interplay between two moduli spaces that get glued together to form a bigger one. This can be expressed in terms of algebraic diagrams.
• Recurrent neural networks (Hopfield, 1988) become a stack of the same network quiver, and therefore the same moduli space gets glued with copies of itself multiple times.
• The knowledge stored in the moduli space in the form of the map ϕ(W, f ) provides a new concept to express and understand transfer learning (Baxter, 1998). Extending a trained network will globally change the moduli space, while fixing the map ϕ(W, f ) in the sub-space corresponding to the unchanged part of the network quiver.
On expanding the algebraic understanding of neural networks, we consider the following approaches for further research, • Study the possibility to transfer the gradient descent optimization to the moduli space with the goal of not only optimizing the network weights but also the activation functions.
• The combinatorics of network quivers seem key to the understanding of neural networks and their moduli spaces. So a further study of network quivers by themselves is required (Assem et al., 2006;Barot, 2015;Schiffler, 2014).
• Continuity and differentiability of the network function Ψ(W, f ) and the map ϕ(W, f ) will allow the use of more specific algebraic-geometric tools (Hartshorne, 1977;Reineke, 2008). Even more, the moduli space is a toric variety and then we can use toric geometry (Cox et al., 2011) to study the moduli space (see Hille, 1998;Domokos and Jo, 2016).
Finally, this work provides a language in which to state a different kind of scientific hypotheses in deep learning, and we plan to use it as such. Many characterizations will arise from the interplay of algebraic methods and optimization. Namely, when solving a task in practical deep learning one tries different hidden quivers and optimization hyperparameters. Therefore, measuring changes in the hidden quiver will become important.
Appendix A. Example of Theorem 4.9 Here we illustrate with an example the result of Theorem4.9, i.e., that an isomorphism between two neural networks (W, f ) and (V, g) preserves their network function Ψ(W, f )(x) = Ψ(V, g)(x). Lets consider a ReLU multilayer perceptron (W, f ) with 2 hidden layers of 3 neurons each, 2 neurons on the input layer and 2 neurons on the output layer. That is, We denote by W 1 , W 2 and W 3 the weight matrices of the network from left to right. Consider the explicit matrices Assume now that the input vector is the vector x = −1. Therefore the score (or output) of the network is, Consider now a change of basis for (W, f ). As mentioned in the text, the change of basis for the input and output neurons are set to 1 (i.e., τ a = τ b = τ i = τ j = 1). As for the six hidden neurons, lets consider the following six change of basis τ = The weights of the second layer become and those of the third layer As for the activations, one has to apply Eq. (2), i.e., τ v ReLU ( x τv ) in our case. Note that if τ v > 0 then τ v ReLU ( x τv ) = τv τv ReLU (x) = ReLU (x) which derives from the positive scale invariance of ReLU. However, if τ v < 0 then τ v ReLU ( x τv ) = min(x, 0) = g(x). Considering the change of basis matrix τ given before, it derives that τ f = Lets apply a forward pass on the neural network (τ W, τ f ) for the same input x = −1.2 0.3 . On the first layer we have: which is the same output than the one for (W, f ) computed before. We can also observe that the activation output of each hidden (and output) neuron on (τ W, τ f ) is equal to the activation output on that same neuron in (W, f ) times the change of basis of that neuron, as noted in the proof of Theorem 4.9.
Appendix B. Example of Theorem 6.4 Here we compute an example to illustrate that Ψ(W, f )(x) = Ψ(W f x , 1)(1 d ). We will work with the notation of Appendix A for the ReLU MLP with explicit weight matrices W 1 , W 2 and W 3 and input vector x given by Recall the definition of the representation W f x , if s( ) is a hidden vertex.
Denote by V 1 , V 2 and V 3 the weight matrices of W f x . We can easily see that V 1 is given by Lets compute a forward pass of the network (V, 1) = (W f x , 1) given the input 1 1 and verify that the output is the same as that of AppendixA. We have that As noted in the proof of Theorem 6.4, the activation output of each neuron in (W f x , 1) after a forward pass of the input vector 1 1 , is equal to the pre-activation of that same neuron in (W, ReLU ) after a forward pass of x.

Appendix C. Double-framed quiver moduli
In this appendix, we adapt the computation of the dimension of the moduli space given by Reineke (2008) for double-framed representations. In here, we also assume that there are no bias vertices. Let Q = (V, E, s, t) be a network quiver. Recall that the delooped quiver Q • = (V • , E • , s • , t • ) is obtained from Q by removing all loops. The hidden quiver Q = ( V, E, s, t) is obtained from the delooped quiver by removing the input and the output layers. We will consider the group of integer vectors indexed by the hidden vertices of Q, that is, whose group operation is coordinate wise addition. The Euler form of the hidden quiver Q is the map [−, −] Q : Z V × Z V → Z given by where in [v, v ] Q the vectors v, v are considered as vectors in Z V with entry one in position v and v , respectively, and zero otherwise. In here, the Euler form on the vectors v, v is by definition [v, v ] Q :=< v, v > −# ∈ E : s( ) = v and t( ) = v where < v, v > is the usual dot product of vectors indexed by V. In other words, The dimension vector η = (η v ) v∈ V of a quiver representation W = (W v ) v∈ V , (W ) ∈ E is an integer vector indexed by V where η v = dim(W v ) for every v ∈ V. We are interested in the particular dimension vector of thin representations γ := (1, ..., 1). It can be shown that the Euler form We consider now another dimension vector m = (m v ) v∈ V where m v = 0 if v is not an output vertex of the hidden quiver Q and m v = k (the number of output vertices of Q) if v ∈ V is an output vertex of the hidden quiver Q. A framed quiver representation of Q by the dimension vector m = (m v ) v∈ V is a thin representation M of Q together with a family of linear maps (h v : M v → C mv ) v∈ V .
We now extend Q to a quiver Q ∨ = ( V ∨ , E ∨ , s ∨ , t ∨ ) where we add a new vertex ∞ by V ∨ = V ∪ {∞}. We also add m v = k arrows (the number of output vertices of Q) from v ∈ V to the vertex ∞ when v is an output vertex of Q, and we add no more arrows to Q ∨ . We extend the dimension vector γ = (1, ..., 1) to a dimension vector of Q ∨ by defining γ ∞ = 1. We still denote γ = (1, ..., 1) the extended dimension vector. The Euler form of Q ∨ is given by [γ, γ] Thus, the dimension of the framed quiver moduli space is We now extend the quiver Q ∨ to obtain another quiver ∨ Q ∨ . We add a new vertex that we denote 0 to V ∨ and we add n v = d (the number of input vertices of Q) arrows from 0 to v if v ∈ V is an input vertex of Q, and we add no more arrows. We also extend the dimension vector γ = (1, ..., 1) to a dimension vector in ∨ Q ∨ by γ 0 = 1. The double-framed quiver ∨ Q ∨ is equal to the delooped quiver Q • , and recall that the W f x 's are representations of Q • . Thus, the dimension of the moduli space of double-framed thin quiver representations in terms of its Euler form is given by In our case, < n, γ > and < m, γ > count the number of arrows in the input and output layers of Q, respectively. For the Euler form of Q we have that [γ, γ] Q = # V − # E, and finally dim d M k ( Q) = < n, γ > + < m, γ > + # E − # V − 1 = #E • − # V − 1, as claimed.