Pullback Bundles and the Geometry of Learning

Explainable Artificial Intelligence (XAI) and acceptable artificial intelligence are active topics of research in machine learning. For critical applications, being able to prove or at least to ensure with a high probability the correctness of algorithms is of utmost importance. In practice, however, few theoretical tools are known that can be used for this purpose. Using the Fisher Information Metric (FIM) on the output space yields interesting indicators in both the input and parameter spaces, but the underlying geometry is not yet fully understood. In this work, an approach based on the pullback bundle, a well-known trick for describing bundle morphisms, is introduced and applied to the encoder–decoder block. With constant rank hypothesis on the derivative of the network with respect to its inputs, a description of its behavior is obtained. Further generalization is gained through the introduction of the pullback generalized bundle that takes into account the sensitivity with respect to weights.


Introduction
Explainable Artificial Intelligence (XAI) is generally described as a collection of methods allowing humans to understand how an algorithm is able to learn from a database, reproduce and generalize.It is currently an active, multidisciplinary area of research [1,2] that relies on several theoretical or heuristic tools to identify salient features and indicators explaining the surprisingly performances of machine learning algorithms, especially deep neural networks.From a statistical point of view, a neural network is nothing but a parameterized regression or classification model, that can be described as a random variable whose probability distribution is known conditionally to external inputs and internal parameters [3].Unfortunately, even if this approach seems the most natural one, it is not adapted to XAI as no insight is gained on the learning and inference process.Furthermore, it seems that there is a contradiction between the statistical procedure that appeals for models with the smallest possible number of free parameters and the performance of deep learning relying on thousands to millions weights.On the other hand, attempts have been made to design numerical [4] or visual [5] indicators aiming at producing a summary of salient features.
XAI is also related to acceptable AI, that is proving or at least ensuring with a high probability that the model will produce the intended result and is robust to perturbations, either inherent to the data acquisition process or intentional.In both cases, it is mandatory to be able to perform a sensitivity analysis on a trained network.In [6], an approach based on geometry was taken and the need of a metric on the set of admissible perturbations enforced.The problem of the so-called adversarial attacks is treated in several papers [7][8][9] where mitigating procedures are proposed.Adversarial attacks are a major concern for acceptable AI, especially in critical application like autonomous vehicles or air traffic control.From now, most of the research effort was dedicated to the design of such attacks with the idea of incorporating the fooling inputs in the learning database in order to increase robustness.The reader can refer, for example, to Fast Gradient Sign methods [10], robust optimization methods [11] or DeepFool [12,13].Unfortunately, while these approaches are relevant to acceptable AI, they do not provide XAI with usable tools.Furthermore, they rely on inputs in R n , or generally in a finite dimensional Euclidean space, which is not always a valid hypothesis.
There is also a question on why learning from a high dimension data space is possible, and a possible answer is because data effectively lies on a low dimensional manifold [14,15].As a consequence, most of the directions in the input space will have a very small impact on the output, while only a few number of them, namely those who are tangent to the data manifold, are going to be of great influence [16].The manifold hypothesis also justifies the introduction of the encoder-decoder architecture [17] that is of wide use in the field of natural language processing [18] or time-series prediction [19].The true underlying data manifold, if it exists, is most of the time not accessible, although some of its characteristics may be known and incorporated in the model.In particular, it may be subject to some action by a Lie group or possess extra geometric properties, like the existence of a symplectic structure.Specific networks have be designed to cope with such situations [20,21].
In a general setting, little is known about the data manifold and its geometric features, like metric, Levi-Civita connection and curvature.However, Riemannian properties are the most important ones as they dictate the behavior of the network under moves in the input space.Recalling the statistical approach invoked before, it makes sense to model the output of the network as a density probability parameterized by inputs and weights.Within this frame, there exists a well-defined Riemannian metric on the output space known as the Fisher Information Metric (FIM) originating from a second order expansion of the Kullback-Leibler divergence.The importance of this metric has already been pointed out in several past works [22,23].The FIM can be pulled back to the input space, yielding, in most cases, a degenerate metric that can nevertheless be exploited to better understand the effect of perturbations [16], or to parameter space to improve gradient-based learning algorithms [24].In this last case, however, things tend to be less natural than for the input space.
In this work, a unifying framework for studying the geometry of deep networks is introduced, allowing a description of encoder-decoder blocks from the FIM perspective.The pullback bundle is a key ingredient in our approach.
In the sequel, features and outputs are random variables, thus characterized by their distribution functions, or their densities in the absolutely continuous case.Within this frame, a neural network is a random variable: where (Ω, T , P) is an underlying probability space and (E, E ), (Θ, F ) are, respectively, the input and weight measure spaces Finally, Y is assumed to take its values in the output measure space (O, O).Most of the time, the network has a layered structure so that the expression of N can be factored out as: In many practical implementations, the weights W are deterministic, that is equivalent to saying that their probability distribution is a Dirac distribution.In this case, a neural network can be described as a parameterized family of random variables N W : ω → N (X(ω), W).A special case occur when a single decoder is considered [25], that is, a measurable function: where f is a smooth mapping, assumed in [25] to be an immersion; that is, for any x, D f x has maximal rank d.Conversely, one may consider an encoder and assume f to be a submersion.In this paper, the geometry of the complete encoderdecoder network will be considered, as well as the case d ≥ m, d ≤ n.
The article is structured as follows: In Section 2, the Fisher information metric is introduced and some formulas, valid when the parameter space is a smooth manifold, are given.In Section 3, the pullback bundle is defined and applied to the encoder-decoder case.Finally, a conclusion is drawn in Section 5.The convention of summation on repeated indices applies in this manuscript.

The Fisher Information Metric
In this section, we recall some basic definitions and properties in information geometry.The foundational ideas can be traced back to [26], but the main developments occur quite recently.The reader is referred to [27] for a comprehensive introduction.The exposition below assumes a quite high degree of regularity for the parameterized density families, which is nevertheless a common situation in practice, especially in the field of machine learning we are interested in.

Definitions and Properties
Definition 1.A statistical model is a pair (M, p) where M is an oriented n dimensional smooth manifold and (p θ ) θ∈M is a parameterized family of probability densities on a measured space (Ω, T , µ) such that, putting p(θ, ω) = p θ (ω): Assuming p never vanishes, one can define the score l : M × Ω → R as: For any θ ∈ M: Thus, using the fact that the assumptions made on family p θ allow swapping derivatives and integrals, it becomes: where ∂ i denotes the derivative with respect to the i-th component of θ in local coordinates.So, the score l θ = log p θ satisfies by (8): A simple computation shows that: proving that: Let g be the section of TM * ⊗ TM * defined by: Now, given any tangent vector X = X i ∂ i ∈ T θ M: with Given the assumptions made on the family p θ , g is a thus a positive definite symmetric section of TM ⊗ TM, hence a Riemannian metric on M called the Fisher Information Metric (FIM).
Remark 1.The mapping I : θ → √ p θ embeds M as a submanifold of the unit sphere in L 2 Ω,µ and the Fisher information metric is just the pullback of the ambient metric in L 2 Ω,µ with respect to I.However, in machine learning applications, it is common to consider parameter spaces for which the one-to-one assumption for I is non-valid so that g is only positive semidefinite.The study of the rank of the metric in this case is an important research topic.
It is quite fruitful to consider differential forms on M parameterized by Ω.The starting point is the definition of parameterized degree 0 forms.
For all θ 0 ∈ Ω, and all integers n, there exists a neighborhood U n,θ 0 and an integrable positive mapping h n,θ 0 such that for all θ ∈ U n,θ 0 and almost all ω ∈ Ω: Proposition 1.Let X be a vector field on TM and f a parameterized 0-form in the previous sense.Then: with l(θ, ω) = log p(θ, ω)n.
Proof.E[ f ] is a degree 0 form on TM.If ψ is the flow of X, then: The assumptions made on f allowing the swapping of derivatives and integrals, so: Remark 2. Applying Proposition 1 to the constant function f = 1 yields E[X(l)] = 0, a result already known by Equation ( 9) A parameterized degree k differential form on TM can be defined readily by requiring that the coefficients of the elementary forms dθ i 1 ∧ • • • ∧ dθ i k be parameterized differential forms of degree 0. Proposition 2. Let α be a degree k parameterized differential form on TM.Then: since: the claim follows.
Proceeding the same way as in Proposition 1, and using Cartan's homotopy formula, we obtain: Proposition 3. Let X be a vector field on TM and α a degree k parameterized differential form.
When α = dl, Equation ( 21) reads as: Since E[dl] = 0, it becomes: Given two vector fields X, Y: with g the Fisher metric.Thus: , and after taking the expectation: This is a well-known result in the R n case.
Let ∇ be an affine connection on TM.The same computation as above yields: Proposition 5. Let X be a vector field on TM and α a degree k parameterized differential form.Then: showing that while the parameterized Hessian ∇dl depends on the connection ∇, it is not the case of its expectation.
When Ω = M = R n , µ = dx 1 dx 2 . . .dx n , the Fisher metric is known to be twice the second order term in the Taylor expansion of the Kullback-Leibler divergence, which can be proved easily by iterating derivatives.More generally, let ∇ be a connection and let θ : ] − , [→ M, > 0 be a smooth curve with θ 0 = θ(0), X = θ (0).Recall that the Kullback-Leibler divergence between two probability densities p, q is defined as: The mapping: is smooth, so Taylor formula applies for t close enough to 0: With: If the curve t → θ(t) is a geodesic for ∇, then: And, by recurrence: The first derivative ξ (1) (0) is readily computed as: The second derivative ξ (2) (0) can be obtained using ∇ as : ) characterizes g as θ 0 .Higher-order terms can be computed by repeatedly applying Proposition 5 and are expressed thanks to the quantities: An interesting case occurs when the Fisher metric is non-degenerate and ∇ lc is its associated Levi-Civita connection.Normal coordinates at θ 0 , denoted by x i , i = 1 . . .N, are given by taking an orthonormal basis, with respect to the Fisher metric, (v 1 , . . ., v N ) and letting [28] (p.72): Using the x i , i = 1 . . .N system of coordinates in place of θ, and noting that θ 0 corresponds to the origin in normal coordinates, the KL divergence can be approximated at order 2 by: where x = x 1 , . . ., x N .

The Fisher Information in Machine Learning
In machine learning applications, when the output is a probability distribution, then the Kullback-Leibler divergence is a natural measure for goodness-of-fit.Assuming that the database is given in the form of an iid sample of couples (X i , Y i ) i=1...N , then one can introduce the error function: That may be approximated by: where the notation − → PQ stands for the tangent vector at P such that a geodesic (for ∇ lc ) θ with θ(0 Taking the derivative with respect to W yields: with ∂N (X i ,W) ∂W being a tangent vector at Y i .We recall the musical isomorphism : TM → TM defined by: and use it to rewrite (41) as: In this form, having a critical point of the energy Ẽ with respect to W is equivalent to the vanishing of a totally symmetric multilinear form on TM ⊕ TM , the generalized tangent bundle of M.
Finally, if ψ : N → M is a smooth mapping, one can take the pullback the Fisher metric on M to obtain a semi-definite symmetric bilinear form on N : When ψ is an embedding, ψ g is a Fisher metric on N with p ψ(η) , η ∈ N as underlying densities.This is the case considered in [25].
As an example of a pullback metric, we are going to investigate the case of the von Mises-Fisher distribution (VMF) on S n−1 with density: where κ ≥ 0 is the concentration parameter, µ ∈ S n−1 is the location parameter and I k is the modified Bessel function of the first kind of order k.The Fisher metric in the embedding space R n can be deduced from the second moment E xx t since l κ,µ = log p κ,µ = f (κ) + x, µ .If κ is assumed to be constant, then: Although the expression for E xx t has been given in [29], we present here an alternative proof based on the fact that for any integer n, S n−1 is a suspension of S n−2 .If x = (x 1 , . . . ,x n ), then xx t is a matrix whose (i, j) entry is x i x j .By the rotation invariance of the VMF, µ can be selected as the first vector of an orthonormal basis, with respect to which x is expressed in components as x = (x 1 , . . . ,x n ).If we specialize the first component, then, if i = 1, j = 1: with x i = sin θξ i , i = 1 . . .n − 1 and σ n−2 the Lebesgue measure on S n−2 .If i = j, then the integral vanishes by symmetry, otherwise: with A S n−3 the area of the n − 3-sphere, which is given by the general relation: Now, observing that [30]: with B the beta function, the overall expression becomes, after using (49): When i = j = 1, then the expression for the second moment becomes: The integral is a difference of two terms, each of which can be simplified as before to yield: This procedure can easily be applied to an arbitrary moment, each of the integral involved being expressible using I n and the Beta function.
Remark 4. Since µ is not a parameterization of the unit sphere, the Fisher metric defined that way is related to an ambient metric in R n , defined only on the unit sphere.
An obvious embedded dimension n − 2 submanifold of S n−1 is obtained by taking a unit vector ν and computing the intersection of S n−1 with an hyperplane H defined by: An elementary computation proves that the intersection locus is a n − 2 sphere contained in H: Without loss of generality, ν can be taken as 1 0 . . .0 and the embedding can be written easily as: The pullback metric is just the original one scaled by 1 − α 2 .The loss functions related to the VMF distribution are discussed in [31].

Pullback Bundles
In this section, a neural network with weights W is a mapping N (•, W) : I → O, where I (i.e., O) is the input (i.e., output) manifold of dimension n (i.e., m).Both manifolds are assumed to be smooth, and also the mapping N W .This last assumption is valid when the activation functions are smooth, which is the case for sigmoid functions, but not for the commonly used ReLu function.However, smooth approximations to the ReLu are easy to construct with an arbitrary degree of accuracy, so the framework introduced below can be still applied.
As mentioned in the introduction, O is further assumed to be a statistical model 1 with Fisher metric g.This setting is the one of a neural network whose output is a random variable with conditional density in a family p θ , θ ∈ O.
When the weights are kept fixed, the only free parameters are the inputs and the network is fully described by the mapping: For the ease of notation, the mapping N (•, W) will be abbreviated by N W (•). When the activation functions in the network are smooth, N W (•) is a smooth mapping and its derivative will be denoted by dN W (• • • ).With this convention, the pullback metric of g by N W (•), denoted g, is defined by: Unless the network N is a decoder, g is generally degenerated and does not provide I with a Riemannian structure, so an ambient metric h on I is assumed to exist.The triple (I, h, g) is called the data manifold of the network.The kernel of g, denoted ker g, is the distribution in TI consisting of vectors X such that g(X, •) is the zero mapping.At a point x ∈ I, the vectors in T x I belonging to ker g give directions in which the output of the network will not change up to order 1. Figure 1 represents the case of a one dimensional output space and a 2-sphere input space.Since the dimension of the output is less than the one of the input, some moves in the data manifold will not induce any change at the output.
Unless the dimension of ker g is constant, this distribution does not define a foliation.However, this is true locally in the neighborhood of points in I such that dN W (•) has maximal rank.Finally, if E O π is an r-vector bundle on O, then its pullback by N W (•) will be denoted in short by E N W .We recall that if E has local charts: and I has local charts U j , φ i , j ∈ J, then E N W has local charts: The pullback bundle enjoys a universal property that is in fact the main reason for introducing it in our context.(E, π, O)) be a vector bundle on I (i.e., O)).For any bundle morphism (η 1 , η 0 ), there exists a unique bundle morphism ( η1 , Id) such that the following diagram commutes: where π η 0 : (x, v) → x and η0 : (x, v) → (η 0 (x), v).This proposition is a classical one and its proof can be found in many textbooks.The one we give below is very simple, using only local charts.
The above construction is constructive and thus gives a practical mean of computation.For a network with fixed weights, e.g., a trained one, the derivative dN W can be efficiently computed by back propagation, so the bundle morphism: has a practical meaning.
Introducing the pullback bundle gives the diagram: The bundle mapping dN W to T N W O is then the association: The pullback bundle is thus a mean of representing the action of the network on tangent vectors to the data manifold.As an example, the construction of adversarial attacks given in [32,33] can be revisited in this context, extending it to the general setting of network with manifold inputs.
The general problem of building an adversarial attack is, informally, to find, for an input point in the data manifold, a direction in which a perturbation will have the most important effect on the output, hopefully fooling the network.Following [33], we define: Definition 3. Let h be a Riemannian metric on the input space.An optimal adversarial attack at x ∈ I with budget > 0 is a solution to: Using (38), this optimization program can be viewed as a local approximation to the one based on the Kullback-Leibler divergence: Definition 4. A Kullback-Leibler optimal adversarial attack at x ∈ I with budget > 0 is a solution to: The metric g on TO can be pulled back to T N W by letting: Due to the special form of the criterion, the optimal point is on the boundary, so that finally, the optimal adversarial attack problem may be formulated as: Definition 5.An optimal adversarial attack at x ∈ I with budget > 0 is a solution to: Where UTI stands for the unit sphere bundle with respect to the metric h.Please note that due to bilinearity, the problem can be solved for = 1, then let the optimal vector be scaled by the original .From standard linear algebra, if G x is the matrix of the bilinear form g N W at x and H x the one of h, then one can find unitary matrices A, B and diagonal matrices Λ, Σ such that: Any vector v in UT x I can be written as: So that, finally, the original problem can be rewritten as: which is solved readily by taking w to be the unit eigenvector of M associated with the largest eigenvalue.This is the solution found in [33] when H x = Id.In many cases, as the above example indicates, it is more convenient to work uniquely in the input space, thus justifying the introduction of the pullback bundle T N W O. From now, we are going to adopt this point of view.
Remark 5. Please note that a section in T N w I is generally not related to a section of the form (63) in either TO or TI due to the fact that d x N W may not be a monomorphism or an epimorphism.The next proposition gives condition for the existence of global sections in TO associated with global sections in T N W O. Proposition 7. In the case of a decoding network, when N W is an embedding, there is a natural embedding of bundles TI T N W O i such that the image of (x, v) is (x, dN W v). The pullback bundle then splits as: where F has rank n − m.
Be careful that in this case, a section of the pullback bundle will not define a global section in TO since some points of the output space may have no preimage by N W .However, by the extension lemma [34] (Lemma 5.34, p. 115), local (global if N W I is closed) smooth vector fields on TO exist, extending it.
Proof.If N W is an embedding, N W (I) is a submanifold of O and in an adapted chart, a vector field in TN W (I) can be written as v = ∑ n i=1 v i ∂ i , where the ∂ i , i = 1 . . .n are the first n coordinate vector fields.It thus pulls back to a section ṽ of the same form in T N W O. Now, since dN W is injective, ṽ is the image of a unique section in TI, hence the claim.Proposition 8.If ker dN W has constant rank r, then there exists a splitting TI = ker dN W ⊕ F, T N W O = im dN W ⊕ G and bundle isomorphism F → im dN W that coincides with dN W on the fibers.
Proof.By Theorem 10.34, [34] (p.266), ker dN W is a subbundle of TI and im dN W a subbundle of T N W O. In local charts, the morphism dN W gives rise to the decomposition: with dN W an isomorphism where restricted to R r .Passing to local sections yields the result.
An important case is the one of submersions, corresponding to encoders in machine learning.In this case, r = m and dN W establishes a bundle isomorphism between F and T N W O. The pullback of Fisher-Rao metric g on TO gives rise to a metric g N W on T N W O, but only to a degenerate metric on TI that can, nevertheless, be quite well understood, as indicated below.Definition 6.On the input bundle TI, the symmetric tensor g is defined using the splitting TI = ker dN W ⊕ F, by: Proposition 9.There exists a symmetric (1, 1)-tensor on I, denoted by Θ, such that, for any tangent vectors (X, Y) ∈ TI: Proof.From standard linear algebra, there exists an adjoint t dN W to dN W , defined by: with, in local coordinates: where N (i.e., t N ) is the matrix associated with dN W (i.e., t dN W ) and, as usual, h il = h −1 il .The (1, 1)-tensor Θ is then the product t dN W dN W . Remark 6. Θ is defined even if dN W is not full rank.Remark 7. All the relevant information concerning dN W is encoded in Θ.As a consequence, the geometry of an encoder is described by this tensor, hence also the one of an encoder-decoder block.
Remark 8.The tensor Θ has expression g pj N j i N p k in a local orthonormal frame, hence is symmetric.

Definition 7.
Let ∇ be a connection on TI.Its dual connection ∇ is defined by the next equation: where Z is any tangent vector in TI and X, Y are vector fields.
Proof.For any vector fields X, Y, and any tangent vector Z: hence the claim.
Θ, being symmetric, admits a diagonal expression in a local orthonormal local frame (X 1 , . . ., X n ).When there exists a connection ∇ such that ∇ Z ΘX = Θ∇ Z X for any vector fields X, Z, parallel transport of the X i , i = 1 . . .n shows that the eigenvalues are constant and the eigenspaces preserved.The existence of a solution to the gauge equation thus greatly simplifies the study of an encoder, as a local splitting of the input manifold exists.The reader is referred to [35] for more details.In fact, the tensor Θ is defined even if for general networks and the splitting may exist in this setting.This is the case when the rank of dN W is locally constant, hence when it is maximal.A practical computation of Θ can be obtained through the singular value decomposition, as Proposition (74) indicates.A numerical integration of the distribution given by the first singular vectors gives rise to a local system of coordinates, defining in turn a connection satisfying the gauge equation (the existence of a global solution has a cohomological obstruction that is outside the scope of this paper).
Finally, we introduce below a construction that takes into account the weight influence.As mentioned in Section 2, the derivative of the network with respect to its weights is adequately described as a 1-form, thus a section of T O.In fact, when the inner layers of the network are manifolds, the parameters are no longer real values and a suitable extension has to be introduced.One possible approach is to take a connection ∇ on the layer manifold L. Considering a point p ∈ L, the exponential exp ∇ defines a local chart centered at p. Given a point q in the injectivity domain of exp ∇ , one can obtain its coordinates as log ∇ p q = pq and the activation of a neuron with input q as α( pq), with α a 1-from in T L. In this general setting, a manifold neuron will be defined by its input in an exponential chart, a 1-form corresponding to the weights in the Euclidean setting and an activation function.Its free parameters are thus a couple (q, α) ∈ TL ⊕ T L. This particular vector bundle is known as the generalized tangent bundle.
Recalling (43), it is worth to study the pullback of the generalized bundle TO ⊕ T O.The generalized pullback bundle is then T N W O ⊕ T N W O, whose local sections are generated by the pullback local sections of the form: Please note that the pullback can be performed on any layer, internal or input.Most of the previous derivations can be carried out on the generalized bundle, which must be thus considered as a general, yet tractable framework for XAI.

A Numerical Example
In this example, the input data are the handwritten digits from the MNIST database.A neural network with the next architecture was coded in torch 2 and trained on the dataset: The input metric is Euclidean, the output one is the Fisher metric of the multinomial distribution with ten classes, that is given by the matrix: Since the output space has dimension 9, the pullback bundle also has dimension 9.At an input point x, a point in the pullback bundle is a couple (x, v) with v a vector from R 9 at output point N W (x). On the other hand, the image of the input tangent bundle (simply a vector space in our case) has points (x, dN W u) with u an input vector.We are thus considering a bundle mapping (x, dN W u) → (x, dN W u) where the right-hand term has values in the pullback bundle, equipped with the output Fisher metric.Tensor Θ is computed via singular value decomposition, already implemented in torch.We selected the rotation rate of the singular vector associated with the largest singular value as an indicator of the complexity of the decision process in the neighborhood of an input point.
The code was adapted from https://github.com/eliot-tron/CurvNetAttack(accessed on 12 September 2023).A detection of outliers from a sample of 1000 points was performed.
A visual analysis reveals that they correspond to poorly drawn digits, as indicated in Figure 2 where the two digits with the highest curvature indicator are plotted: The first one is labeled "9", which is quite obvious for a human operator, although the final stroke is vertical, while the second is labeled "7", easily confused with a "1".

Conclusions and Future Work
In this paper, several important constructions originating from information geometry were surveyed and some new ones introduced.The pullback bundle on a layer allows to describe the behavior of a network with respect to the Fisher information metric, and a simple description can be obtained when a gauge equation is satisfied.One important feature of this construction is its ability to fit in a general framework where layers take their inputs on a manifold.
Future work involves a companion paper describing computational procedures and examples from real case studies.An study of the properties of the pullback generalized bundle is also in progress.Finally, the case of networks with non constant rank dN W must be considered.It is believed that they give rise to singular foliations.

Figure 2 .
Figure 2. Samples with the highest rotation rate.