Next Article in Journal
Automatic P-Phase-Onset-Time-Picking Method of Microseismic Monitoring Signal of Underground Mine Based on Noise Reduction and Multiple Detection Indexes
Next Article in Special Issue
Adversarial Robustness with Partial Isometry
Previous Article in Journal
Mutual Information and Correlations across Topological Phase Transitions in Topologically Ordered Graphene Zigzag Nanoribbons
Previous Article in Special Issue
Information-Geometric Approach for a One-Sided Truncated Exponential Family
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Pullback Bundles and the Geometry of Learning

by
Stéphane Puechmorel
ENAC (École Nationale de l’Aviation Civile), Université de Toulouse, 7, Avenue Edouard Belin, 31055 Toulouse, France
Entropy 2023, 25(10), 1450; https://doi.org/10.3390/e25101450
Submission received: 24 August 2023 / Revised: 12 October 2023 / Accepted: 13 October 2023 / Published: 15 October 2023
(This article belongs to the Special Issue Information Geometry for Data Analysis)

Abstract

:
Explainable Artificial Intelligence (XAI) and acceptable artificial intelligence are active topics of research in machine learning. For critical applications, being able to prove or at least to ensure with a high probability the correctness of algorithms is of utmost importance. In practice, however, few theoretical tools are known that can be used for this purpose. Using the Fisher Information Metric (FIM) on the output space yields interesting indicators in both the input and parameter spaces, but the underlying geometry is not yet fully understood. In this work, an approach based on the pullback bundle, a well-known trick for describing bundle morphisms, is introduced and applied to the encoder–decoder block. With constant rank hypothesis on the derivative of the network with respect to its inputs, a description of its behavior is obtained. Further generalization is gained through the introduction of the pullback generalized bundle that takes into account the sensitivity with respect to weights.

1. Introduction

Explainable Artificial Intelligence (XAI) is generally described as a collection of methods allowing humans to understand how an algorithm is able to learn from a database, reproduce and generalize. It is currently an active, multidisciplinary area of research [1,2] that relies on several theoretical or heuristic tools to identify salient features and indicators explaining the surprisingly performances of machine learning algorithms, especially deep neural networks. From a statistical point of view, a neural network is nothing but a parameterized regression or classification model, that can be described as a random variable whose probability distribution is known conditionally to external inputs and internal parameters [3]. Unfortunately, even if this approach seems the most natural one, it is not adapted to XAI as no insight is gained on the learning and inference process. Furthermore, it seems that there is a contradiction between the statistical procedure that appeals for models with the smallest possible number of free parameters and the performance of deep learning relying on thousands to millions weights. On the other hand, attempts have been made to design numerical [4] or visual [5] indicators aiming at producing a summary of salient features.
XAI is also related to acceptable AI, that is proving or at least ensuring with a high probability that the model will produce the intended result and is robust to perturbations, either inherent to the data acquisition process or intentional. In both cases, it is mandatory to be able to perform a sensitivity analysis on a trained network. In [6], an approach based on geometry was taken and the need of a metric on the set of admissible perturbations enforced. The problem of the so-called adversarial attacks is treated in several papers [7,8,9] where mitigating procedures are proposed. Adversarial attacks are a major concern for acceptable AI, especially in critical application like autonomous vehicles or air traffic control. From now, most of the research effort was dedicated to the design of such attacks with the idea of incorporating the fooling inputs in the learning database in order to increase robustness. The reader can refer, for example, to Fast Gradient Sign methods [10], robust optimization methods [11] or DeepFool [12,13]. Unfortunately, while these approaches are relevant to acceptable AI, they do not provide XAI with usable tools. Furthermore, they rely on inputs in R n , or generally in a finite dimensional Euclidean space, which is not always a valid hypothesis.
There is also a question on why learning from a high dimension data space is possible, and a possible answer is because data effectively lies on a low dimensional manifold [14,15]. As a consequence, most of the directions in the input space will have a very small impact on the output, while only a few number of them, namely those who are tangent to the data manifold, are going to be of great influence [16]. The manifold hypothesis also justifies the introduction of the encoder–decoder architecture [17] that is of wide use in the field of natural language processing [18] or time-series prediction [19]. The true underlying data manifold, if it exists, is most of the time not accessible, although some of its characteristics may be known and incorporated in the model. In particular, it may be subject to some action by a Lie group or possess extra geometric properties, like the existence of a symplectic structure. Specific networks have be designed to cope with such situations [20,21].
In a general setting, little is known about the data manifold and its geometric features, like metric, Levi-Civita connection and curvature. However, Riemannian properties are the most important ones as they dictate the behavior of the network under moves in the input space. Recalling the statistical approach invoked before, it makes sense to model the output of the network as a density probability parameterized by inputs and weights. Within this frame, there exists a well-defined Riemannian metric on the output space known as the Fisher Information Metric (FIM) originating from a second order expansion of the Kullback–Leibler divergence. The importance of this metric has already been pointed out in several past works [22,23]. The FIM can be pulled back to the input space, yielding, in most cases, a degenerate metric that can nevertheless be exploited to better understand the effect of perturbations [16], or to parameter space to improve gradient-based learning algorithms [24]. In this last case, however, things tend to be less natural than for the input space.
In this work, a unifying framework for studying the geometry of deep networks is introduced, allowing a description of encoder–decoder blocks from the FIM perspective. The pullback bundle is a key ingredient in our approach.
In the sequel, features and outputs are random variables, thus characterized by their distribution functions, or their densities in the absolutely continuous case. Within this frame, a neural network is a random variable:
Y = N X , W X : Ω , T , P E , E W : Ω , T , P Θ , F
where Ω , T , P is an underlying probability space and E , E , Θ , F are, respectively, the input and weight measure spaces Finally, Y is assumed to take its values in the output measure space O , O . Most of the time, the network has a layered structure so that the expression of N can be factored out as:
Y = N N , W 2 , W 1
In many practical implementations, the weights W are deterministic, that is equivalent to saying that their probability distribution is a Dirac distribution. In this case, a neural network can be described as a parameterized family of random variables N W : ω N X ( ω ) , W . A special case occur when a single decoder is considered [25], that is, a measurable function:
f = N · , W : R d R m , d m
where f is a smooth mapping, assumed in [25] to be an immersion; that is, for any x, D f x has maximal rank d . Conversely, one may consider an encoder
g = N · , W ˜ : R n R d , d n
and assume f to be a submersion. In this paper, the geometry of the complete encoder–decoder network
g f = N N · , W , W ˜
will be considered, as well as the case d m ,   d n .
The article is structured as follows: In Section 2, the Fisher information metric is introduced and some formulas, valid when the parameter space is a smooth manifold, are given. In Section 3, the pullback bundle is defined and applied to the encoder–decoder case. Finally, a conclusion is drawn in Section 5. The convention of summation on repeated indices applies in this manuscript.

2. The Fisher Information Metric

In this section, we recall some basic definitions and properties in information geometry. The foundational ideas can be traced back to [26], but the main developments occur quite recently. The reader is referred to [27] for a comprehensive introduction. The exposition below assumes a quite high degree of regularity for the parameterized density families, which is nevertheless a common situation in practice, especially in the field of machine learning we are interested in.

2.1. Definitions and Properties

Definition 1.
A statistical model is a pair M , p where M is an oriented n dimensional smooth manifold and p θ θ M is a parameterized family of probability densities on a measured space ( Ω , T , μ ) such that, putting p ( θ , ω ) = p θ ( ω ) :
  • For μ-almost all ω Ω , the mapping θ p ( θ , ω ) is smooth;
  • For any θ M , there exists an open neighborhood U θ of θ and an integrable mapping h : Ω R + such that, for any ξ U θ , | θ p ( ξ , ω ) | h ;
  • The mapping θ p θ L 1 ( Ω , μ ) is one-to-one;
  • The support of p θ does not depend on θ.
Assuming p never vanishes, one can define the score l : M × Ω R as:
l ( θ , ω ) = log p ( θ , ω )
For any θ M :
Ω p θ ω d μ ( ω ) = 1
Thus, using the fact that the assumptions made on family p θ allow swapping derivatives and integrals, it becomes:
Ω i p θ , ω d μ ( ω ) = 0 , i = 1 n
where i denotes the derivative with respect to the i-th component of θ in local coordinates. So, the score l θ = log p θ satisfies by (8):
E i l θ p θ = 0 , i = 1 n .
A simple computation shows that:
E i l θ j l θ = Ω i p θ p θ j p θ p θ d μ ( ω ) = 4 Ω i p θ j p θ d μ ( ω ) , i , j = 1 n
proving that:
g i j = E i l θ j l θ = i p θ , j p θ L 2 ( Ω , μ )
Let g be the section of T M * T M * defined by:
g = g i j d θ i d θ j
Now, given any tangent vector X = X i i T θ M :
g ( θ ; X , X ) = g i j X i X j = i p θ , j p θ L 2 ( Ω , μ ) = X i i p θ , X j j p θ L 2 ( Ω , μ ) = Z , Z L 2 ( Ω , μ )
with Z = X i i p θ . Given the assumptions made on the family p θ , g is a thus a positive definite symmetric section of T M T M , hence a Riemannian metric on M called the Fisher Information Metric (FIM).
Remark 1.
The mapping I : θ p θ embeds M as a submanifold of the unit sphere in L Ω , μ 2 and the Fisher information metric is just the pullback of the ambient metric in L Ω , μ 2 with respect to I . However, in machine learning applications, it is common to consider parameter spaces for which the one-to-one assumption for I is non-valid so that g is only positive semidefinite. The study of the rank of the metric in this case is an important research topic.
It is quite fruitful to consider differential forms on M parameterized by Ω . The starting point is the definition of parameterized degree 0 forms.
Definition 2.
A parameterized 0-form is a mapping f : M × Ω R satisfying:
  • For almost all ω Ω , the mapping θ M f ( θ , ω ) is smooth;
  • For all θ 0 Ω , and all integers n, there exists a neighborhood U n , θ 0 and an integrable positive mapping h n , θ 0 such that for all θ U n , θ 0 and almost all ω Ω : | θ n f ( θ , ω ) | h n , θ 0 ( ω ) .
Proposition 1.
Let X be a vector field on T M and f a parameterized 0-form in the previous sense. Then:
X E f = E X f + E f X ( l )
with l ( θ , ω ) = log p ( θ , ω ) n .
Proof. 
E f is a degree 0 form on T M . If ψ is the flow of X, then:
ψ E f = Ω f ψ ( t , θ ) , ω p ψ ( t , θ ) , ω d μ ( ω )
The assumptions made on f allowing the swapping of derivatives and integrals, so:
t t = 0 E f = Ω θ f ( θ , ω ) X ( θ ) p ( θ , ω ) d μ ( ω ) + Ω f ( θ , ω ) θ p ( θ , ω ) p ( θ , ω ) p ( θ , ω ) d μ ( ω )
Remark 2.
Applying Proposition 1 to the constant function f = 1 yields E X ( l ) = 0 , a result already known by Equation (9)
A parameterized degree k differential form on T M can be defined readily by requiring that the coefficients of the elementary forms d θ i 1 d θ i k be parameterized differential forms of degree 0.
Proposition 2.
Let α be a degree k parameterized differential form on T M . Then:
d E α = E d α + E d l α
Proof. 
It is enough to consider a form α ( θ , ω ) = f ( θ , ω ) d θ i 1 d θ i k . Then:
d E α ( θ ) = j = 1 n E θ j f d θ j d θ i 1 d θ i k + j = 1 n E f θ j l d θ j d θ i 1 d θ i k
since:
d α = j = 1 n θ j f d θ j d θ i 1 d θ i k
d l α = j = 1 n f θ j l d θ j d θ i 1 d θ i k
the claim follows. □
Proceeding the same way as in Proposition 1, and using Cartan’s homotopy formula, we obtain:
Proposition 3.
Let X be a vector field on T M and α a degree k parameterized differential form. Then
L X E α = E i X d α + E d i X α + E i X d l α
When α = d l , Equation (21) reads as:
L X E d l = E i X d 2 l + E d i X d l + E i X d l d l
Since E d l = 0 , it becomes:
E d i X d l = E i X d l d l
Given two vector fields X , Y :
i Y E i X d l d l = E i X d l i Y d l = g ( X , Y )
with g the Fisher metric. Thus:
Proposition 4.
g ( X , Y ) = E i Y d i X d l
Remark 3.
In coordinates, i Y d l X d l = i j X j Y i + j l i X j Y i , and after taking the expectation:
g ( X , Y ) = E i j l X j Y i
This is a well-known result in the R n case.
Let ∇ be an affine connection on T M . The same computation as above yields:
Proposition 5.
Let X be a vector field on T M and α a degree k parameterized differential form. Then:
X E α = E X α + E i X d l α
When α = d l , we recover E X d l ( Y ) = g ( X , Y ) , showing that while the parameterized Hessian d l depends on the connection ∇, it is not the case of its expectation. When Ω = M = R n , μ = d x 1 d x 2 d x n , the Fisher metric is known to be twice the second order term in the Taylor expansion of the Kullback–Leibler divergence, which can be proved easily by iterating derivatives. More generally, let ∇ be a connection and let θ : ] ϵ , ϵ [ M , ϵ > 0 be a smooth curve with θ 0 = θ ( 0 ) , X = θ ( 0 ) . Recall that the Kullback–Leibler divergence between two probability densities p , q is defined as:
KL p , q = E p log p / q = log p ( x ) q ( x ) p ( x ) d x
The mapping:
t ] ϵ , ϵ [ ξ ( t ) = KL p θ 0 , p θ ( t ) = E p θ 0 l θ 0 ( t ) l θ ( t )
is smooth, so Taylor formula applies for t close enough to 0:
ξ ( t ) = i = 1 n ξ ( i ) ( 0 ) i ! t i + o ( t n )
With:
ξ ( i ) ( 0 ) = E p θ 0 X X X ( l ) i times = E p θ 0 X X d l ( X ) i 1 times
If the curve t θ ( t ) is a geodesic for ∇, then:
X d l ( X ) = X d l ( X ) + d l X X = X d l ( X )
And, by recurrence:
ξ ( i ) ( 0 ) = E p θ 0 X ( i 1 ) d l ( X ) .
The first derivative ξ ( 1 ) ( 0 ) is readily computed as:
E d l θ 0 X = 0 .
The second derivative ξ ( 2 ) ( 0 ) can be obtained using ∇ as:
E X d l θ 0 ( X ) = g θ 0 ( X , X ) .
Since g is symmetric, g ( X , Y ) = g ( X + Y , X + Y ) g ( X Y , X Y ) / 4 , thus (35) characterizes g as θ 0 . Higher-order terms can be computed by repeatedly applying Proposition 5 and are expressed thanks to the quantities:
E ( i X d l ) X ( i ) d l ( X ) .
An interesting case occurs when the Fisher metric is non-degenerate and lc is its associated Levi-Civita connection. Normal coordinates at θ 0 , denoted by x i , i = 1 N , are given by taking an orthonormal basis, with respect to the Fisher metric, ( v 1 , , v N ) and letting [28] (p. 72):
x i exp θ 0 t j v j = t i
Using the x i , i = 1 N system of coordinates in place of θ , and noting that θ 0 corresponds to the origin in normal coordinates, the KL divergence can be approximated at order 2 by:
KL p 0 , p x = 1 2 x i x j
where x = x 1 , , x N .

2.2. The Fisher Information in Machine Learning

In machine learning applications, when the output is a probability distribution, then the Kullback–Leibler divergence is a natural measure for goodness-of-fit. Assuming that the database is given in the form of an iid sample of couples ( X i , Y i ) i = 1 N , then one can introduce the error function:
E W = i = 1 N KL Y i , N X i , W
That may be approximated by:
E ˜ ( W ) = i = 1 N 1 2 g Y i ; Y i N X i , W , Y i N X i , W
where the notation P Q stands for the tangent vector at P such that a geodesic (for lc ) θ with θ ( 0 ) = P , θ ( 0 ) = P Q is such that θ ( 1 ) = Q . Taking the derivative with respect to W yields:
E ˜ W = i = 1 N g Y i ; N X i , W W , Y i N X i , W
with N X i , W W being a tangent vector at Y i .
We recall the musical isomorphism : T M T M defined by:
X Y = g ( X , Y )
and use it to rewrite (41) as:
E ˜ W = i = 1 N N X i , W W Y i N X i , W
In this form, having a critical point of the energy E ˜ with respect to W is equivalent to the vanishing of a totally symmetric multilinear form on T M T M , the generalized tangent bundle of M .
Finally, if ψ : N M is a smooth mapping, one can take the pullback the Fisher metric on M to obtain a semi-definite symmetric bilinear form on N :
ψ g ( η ; X , Y ) = g ( ψ ( η ) ; ψ ( η ) X , ψ ( η ) Y )
When ψ is an embedding, ψ g is a Fisher metric on N with p ψ ( η ) , η N as underlying densities. This is the case considered in [25].
As an example of a pullback metric, we are going to investigate the case of the von Mises–Fisher distribution (VMF) on S n 1 with density:
p κ , μ ( x ) = κ n / 2 1 2 π n / 2 I n / 2 1 κ exp κ x , μ
where κ 0 is the concentration parameter, μ S n 1 is the location parameter and I k is the modified Bessel function of the first kind of order k. The Fisher metric in the embedding space R n can be deduced from the second moment E x x t since l κ , μ = log p κ , μ = f ( κ ) + x , μ . If κ is assumed to be constant, then:
E μ l κ , μ l κ , μ t = E x x t
Although the expression for E x x t has been given in [29], we present here an alternative proof based on the fact that for any integer n, S n 1 is a suspension of S n 2 . If x = x 1 , , x n , then x x t is a matrix whose ( i , j ) entry is x i x j . By the rotation invariance of the VMF, μ can be selected as the first vector of an orthonormal basis, with respect to which x is expressed in components as x = x 1 , , x n . If we specialize the first component, then, if i 1 , j 1 :
S n 1 x i x j p κ , μ ( x ) d x = c κ 0 π exp cos θ sin n 2 θ S n 2 ξ i ξ j d σ n 2 ( ξ )
with x i = sin θ ξ i , i = 1 n 1 and σ n 2 the Lebesgue measure on S n 2 . If i j , then the integral vanishes by symmetry, otherwise:
S n 2 ξ i ξ j d σ n 2 ( ξ ) = 0 π cos 2 ( ψ ) sin n 3 ( ψ ) S n 3 d σ n 3 d ψ = 0 π cos 2 ( ψ ) sin n 3 ( ψ ) d ψ A S n 3
with A S n 3 the area of the n 3 -sphere, which is given by the general relation:
A S n = 2 π n + 1 2 Γ n + 1 2
Now, observing that [30]:
0 π cos 2 ( ψ ) sin n 3 ( ψ ) d ψ = B 3 2 , n 2 1
with B the beta function, the overall expression becomes, after using (49):
2 π n / 2 Γ n 2 1 I n / 2 ( κ ) κ n / 2 1 κ n / 2 Γ n / 2 1 2 π n / 2 I n / 2 1 ( κ ) = 1 κ I n / 2 ( κ ) I n / 2 1 ( κ )
When i = j = 1 , then the expression for the second moment becomes:
0 π exp κ cos θ cos 2 θ sin n 2 θ d θ A S n 2 = 0 π exp κ cos θ 1 sin 2 θ θ sin n 2 θ d θ A S n 2
The integral is a difference of two terms, each of which can be simplified as before to yield:
1 n κ I n / 2 ( κ ) I n / 2 1 ( κ )
This procedure can easily be applied to an arbitrary moment, each of the integral involved being expressible using I n and the Beta function.
Remark 4.
Since μ is not a parameterization of the unit sphere, the Fisher metric defined that way is related to an ambient metric in R n , defined only on the unit sphere.
An obvious embedded dimension n 2 submanifold of S n 1 is obtained by taking a unit vector ν and computing the intersection of S n 1 with an hyperplane H defined by:
x H x , ν = α α [ 0 , 1 ]
An elementary computation proves that the intersection locus is a n 2 sphere contained in H :
| x α ν | 2 = 1 α 2
Without loss of generality, ν can be taken as 1 0 0 and the embedding can be written easily as:
x 1 x n 1 α λ x 1 λ x n 1 , λ = 1 α 2
The pullback metric is just the original one scaled by 1 α 2 . The loss functions related to the VMF distribution are discussed in [31].

3. Pullback Bundles

In this section, a neural network with weights W is a mapping N · , W : I O , where I (i.e., O ) is the input (i.e., output) manifold of dimension n (i.e., m). Both manifolds are assumed to be smooth, and also the mapping N W . This last assumption is valid when the activation functions are smooth, which is the case for sigmoid functions, but not for the commonly used ReLu function. However, smooth approximations to the ReLu are easy to construct with an arbitrary degree of accuracy, so the framework introduced below can be still applied.
As mentioned in the introduction, O is further assumed to be a statistical model 1 with Fisher metric g . This setting is the one of a neural network whose output is a random variable with conditional density in a family p θ , θ O .
When the weights are kept fixed, the only free parameters are the inputs and the network is fully described by the mapping:
N · , W : I O x N · , W = p θ ( x )
For the ease of notation, the mapping N · , W will be abbreviated by N W · . When the activation functions in the network are smooth, N W · is a smooth mapping and its derivative will be denoted by d N W . With this convention, the pullback metric of g by N W · , denoted g ˜ , is defined by:
g ˜ X , Y = g d N W X , d N W Y
Unless the network N is a decoder, g ˜ is generally degenerated and does not provide I with a Riemannian structure, so an ambient metric h on I is assumed to exist. The triple I , h , g ˜ is called the data manifold of the network. The kernel of g ˜ , denoted ker g ˜ , is the distribution in T I consisting of vectors X such that g ˜ ( X , · ) is the zero mapping. At a point x I , the vectors in T x I belonging to ker g ˜ give directions in which the output of the network will not change up to order 1 .  Figure 1 represents the case of a one dimensional output space and a 2-sphere input space. Since the dimension of the output is less than the one of the input, some moves in the data manifold will not induce any change at the output.
Unless the dimension of ker g ˜ is constant, this distribution does not define a foliation. However, this is true locally in the neighborhood of points in I such that d N W · has maximal rank. Finally, if E π O is an r-vector bundle on O , then its pullback by N W · will be denoted in short by E N W . We recall that if E has local charts:
V i , ξ i , ξ i : V i × R r π 1 ( V i ) , i I
and I has local charts U j , ϕ i , j J , then E N W has local charts:
W j i = U j N W 1 ( V i ) , ψ j i , ψ j i : W j i W j i × R r ψ j i ( x ) = ξ i f ϕ j
The pullback bundle enjoys a universal property that is in fact the main reason for introducing it in our context.
Proposition 6.
Let E ˜ , π ˜ , I (i.e., E , π , O ) be a vector bundle on I (i.e., O ) ). For any bundle morphism η 1 , η 0 , there exists a unique bundle morphism η ˜ 1 , I d such that the following diagram commutes:
Entropy 25 01450 i001
where π η 0 : ( x , v ) x and η ˜ 0 : ( x , v ) η 0 ( x ) , v .
This proposition is a classical one and its proof can be found in many textbooks. The one we give below is very simple, using only local charts.
The above construction is constructive and thus gives a practical mean of computation. For a network with fixed weights, e.g., a trained one, the derivative d N W can be efficiently computed by back propagation, so the bundle morphism:
Entropy 25 01450 i002
has a practical meaning.
Introducing the pullback bundle gives the diagram:
Entropy 25 01450 i003
The bundle mapping d N W to T N W O is then the association:
( x , v ) R m × R n x , d N W · v R n × R n
The pullback bundle is thus a mean of representing the action of the network on tangent vectors to the data manifold. As an example, the construction of adversarial attacks given in [32,33] can be revisited in this context, extending it to the general setting of network with manifold inputs.
The general problem of building an adversarial attack is, informally, to find, for an input point in the data manifold, a direction in which a perturbation will have the most important effect on the output, hopefully fooling the network. Following [33], we define:
Definition 3.
Let h be a Riemannian metric on the input space. An optimal adversarial attack at x I with budget ϵ > 0 is a solution to:
max v T x I , h ( v , v ) ϵ g ˜ v , v
Using (38), this optimization program can be viewed as a local approximation to the one based on the Kullback–Leibler divergence:
Definition 4.
A Kullback–Leibler optimal adversarial attack at x I with budget ϵ > 0 is a solution to:
max y I , h ( x , y ) ϵ KL N x , W , N y , W
The metric g on T O can be pulled back to T N W by letting:
g N W x ; v , v = g f ( x ) ; v , v
Due to the special form of the criterion, the optimal point is on the boundary, so that finally, the optimal adversarial attack problem may be formulated as:
Definition 5.
An optimal adversarial attack at x I with budget ϵ > 0 is a solution to:
max v U T x I ϵ 2 g N W d N W v , d N W v
Where U T I stands for the unit sphere bundle with respect to the metric h . Please note that due to bilinearity, the problem can be solved for ϵ = 1 , then let the optimal vector be scaled by the original ϵ . From standard linear algebra, if G x is the matrix of the bilinear form g N W at x and H x the one of h, then one can find unitary matrices A , B and diagonal matrices Λ , Σ such that:
H x = A t Λ A , G x = B t Σ B
Any vector v in U T x I can be written as:
V = A t Λ 1 / 2 w , w t w = 1
So that, finally, the original problem can be rewritten as:
max w , w t w = 1 w t M t M w , M = Σ 1 / 2 B d x N W A t Λ 1 / 2
which is solved readily by taking w to be the unit eigenvector of M associated with the largest eigenvalue. This is the solution found in [33] when H x = Id .
In many cases, as the above example indicates, it is more convenient to work uniquely in the input space, thus justifying the introduction of the pullback bundle T N W O . From now, we are going to adopt this point of view.
Remark 5.
Please note that a section in T N w I is generally not related to a section of the form (63) in either T O or T I due to the fact that d x N W may not be a monomorphism or an epimorphism. The next proposition gives condition for the existence of global sections in T O associated with global sections in T N W O .
Proposition 7.
In the case of a decoding network, when N W is an embedding, there is a natural embedding of bundles T I i T N W O such that the image of ( x , v ) is ( x , d N W v ) . The pullback bundle then splits as:
T N W O = i T I F
where F has rank n m .
Be careful that in this case, a section of the pullback bundle will not define a global section in T O since some points of the output space may have no preimage by N W . However, by the extension lemma [34] (Lemma 5.34, p. 115), local (global if N W I is closed) smooth vector fields on T O exist, extending it.
Proof. 
If N W is an embedding, N W ( I ) is a submanifold of O and in an adapted chart, a vector field in T N W I can be written as v = i = 1 n v i i , where the i , i = 1 n are the first n coordinate vector fields. It thus pulls back to a section v ˜ of the same form in T N W O . Now, since d N W is injective, v ˜ is the image of a unique section in T I , hence the claim. □
Proposition 8.
If ker d N W has constant rank r, then there exists a splitting T I = ker d N W F , T N W O = i m   d N W G and bundle isomorphism F i m   d N W that coincides with d N W on the fibers.
Proof. 
By Theorem 10.34, [34] (p. 266), ker d N W is a subbundle of T I and im d N W a subbundle of T N W O . In local charts, the morphism d N W gives rise to the decomposition:
ker d N W R r d N W im d N W R m r
with d N W an isomorphism where restricted to R r . Passing to local sections yields the result. □
An important case is the one of submersions, corresponding to encoders in machine learning. In this case, r = m and d N W establishes a bundle isomorphism between F and T N W O . The pullback of Fisher–Rao metric g on T O gives rise to a metric g N W on T N W O , but only to a degenerate metric on T I that can, nevertheless, be quite well understood, as indicated below.
Definition 6.
On the input bundle T I , the symmetric tensor g ˜ is defined using the splitting T I = ker d N W F , by:
g ( X , Y ) = 0 , X ker d N W , Y T I g ( X , Y ) = g N W d N W X , d N W Y , X , Y F
Proposition 9.
There exists a symmetric ( 1 , 1 ) -tensor on I , denoted by Θ, such that, for any tangent vectors ( X , Y ) T I :
h Θ X , Y = g ˜ X , Y
Proof. 
From standard linear algebra, there exists an adjoint t d N W to d N W , defined by:
g N W d N W v , d N W v = h t d N W v , d N W v
with, in local coordinates:
t N i j = h i l N l k g i j N W
where N (i.e., t N ) is the matrix associated with d N W (i.e., t d N W ) and, as usual, h i l = h 1 i l . The ( 1 , 1 ) -tensor Θ is then the product t d N W d N W .
Remark 6.
Θ is defined even if d N W is not full rank.
Remark 7.
All the relevant information concerning d N W is encoded in Θ . As a consequence, the geometry of an encoder is described by this tensor, hence also the one of an encoder–decoder block.
Remark 8.
The tensor Θ has expression g p j N i j N k p in a local orthonormal frame, hence is symmetric.
Definition 7.
Let be a connection on T I . Its dual connection is defined by the next equation:
Z h X , Y = h Z X , Y + h X , Z Y
where Z is any tangent vector in T I and X , Y are vector fields.
Definition 8.
A ( 1 , 1 ) -tensor Θ is said to satisfy the gauge equation [35] if, for all tangent vectors Z:
Z Θ = Θ Z
Proposition 10.
If Θ satisfies the gauge Equation (78), then the ( 0 , 2 ) -tensor defined by:
X , Y h Θ X , Y
is parallel.
Proof. 
For any vector fields X , Y , and any tangent vector Z:
Z h Θ X , Y = h Z Θ X , Y + h Θ X , Z Y = h Θ Z X , Y + h Θ X , Z Y
hence the claim. □
Θ , being symmetric, admits a diagonal expression in a local orthonormal local frame ( X 1 , , X n ) . When there exists a connection ∇ such that Z Θ X = Θ Z X for any vector fields X , Z , parallel transport of the X i , i = 1 n shows that the eigenvalues are constant and the eigenspaces preserved. The existence of a solution to the gauge equation thus greatly simplifies the study of an encoder, as a local splitting of the input manifold exists. The reader is referred to [35] for more details. In fact, the tensor Θ is defined even if for general networks and the splitting may exist in this setting. This is the case when the rank of d N W is locally constant, hence when it is maximal. A practical computation of Θ can be obtained through the singular value decomposition, as Proposition (74) indicates. A numerical integration of the distribution given by the first singular vectors gives rise to a local system of coordinates, defining in turn a connection satisfying the gauge equation (the existence of a global solution has a cohomological obstruction that is outside the scope of this paper).
Finally, we introduce below a construction that takes into account the weight influence. As mentioned in Section 2, the derivative of the network with respect to its weights is adequately described as a 1-form, thus a section of T O . In fact, when the inner layers of the network are manifolds, the parameters are no longer real values and a suitable extension has to be introduced. One possible approach is to take a connection ∇ on the layer manifold L . Considering a point p L , the exponential exp defines a local chart centered at p . Given a point q in the injectivity domain of exp , one can obtain its coordinates as log p q = p q and the activation of a neuron with input q as α p q , with α a 1-from in T L . In this general setting, a manifold neuron will be defined by its input in an exponential chart, a 1-form corresponding to the weights in the Euclidean setting and an activation function. Its free parameters are thus a couple ( q , α ) T L T L . This particular vector bundle is known as the generalized tangent bundle.
Recalling (43), it is worth to study the pullback of the generalized bundle T O T O . The generalized pullback bundle is then T N W O T N W O , whose local sections are generated by the pullback local sections of the form:
x , v N W ( x ) , α N W ( x )
Please note that the pullback can be performed on any layer, internal or input. Most of the previous derivations can be carried out on the generalized bundle, which must be thus considered as a general, yet tractable framework for XAI.

4. A Numerical Example

In this example, the input data are the handwritten digits from the MNIST database. A neural network with the next architecture was coded in torch 2 and trained on the dataset:
  • First layer: convolutional, kernel size of 3, nonlinearity sigmoid;
  • Second layer: convolutional, kernel size of 3, nonlinearity sigmoid;
  • Pooling layer;
  • Two linear layers;
  • Softmax layer.
The input metric is Euclidean, the output one is the Fisher metric of the multinomial distribution with ten classes, that is given by the matrix:
p 1 1 0 0 0 p 2 1 0 0 0 p 9 1 + 1 p 10 1 1
Since the output space has dimension 9, the pullback bundle also has dimension 9. At an input point x, a point in the pullback bundle is a couple ( x , v ) with v a vector from R 9 at output point N W ( x ) . On the other hand, the image of the input tangent bundle (simply a vector space in our case) has points ( x , d N W u ) with u an input vector. We are thus considering a bundle mapping ( x , d N W u ) ( x , d N W u ) where the right-hand term has values in the pullback bundle, equipped with the output Fisher metric. Tensor Θ is computed via singular value decomposition, already implemented in torch. We selected the rotation rate of the singular vector associated with the largest singular value as an indicator of the complexity of the decision process in the neighborhood of an input point. The code was adapted from https://github.com/eliot-tron/CurvNetAttack (accessed on 12 September 2023). A detection of outliers from a sample of 1000 points was performed. A visual analysis reveals that they correspond to poorly drawn digits, as indicated in Figure 2 where the two digits with the highest curvature indicator are plotted:
The first one is labeled “9”, which is quite obvious for a human operator, although the final stroke is vertical, while the second is labeled “7”, easily confused with a “1”.

5. Conclusions and Future Work

In this paper, several important constructions originating from information geometry were surveyed and some new ones introduced. The pullback bundle on a layer allows to describe the behavior of a network with respect to the Fisher information metric, and a simple description can be obtained when a gauge equation is satisfied. One important feature of this construction is its ability to fit in a general framework where layers take their inputs on a manifold.
Future work involves a companion paper describing computational procedures and examples from real case studies. An study of the properties of the pullback generalized bundle is also in progress. Finally, the case of networks with non constant rank d N W must be considered. It is believed that they give rise to singular foliations.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy 2021, 23, 18. [Google Scholar] [CrossRef]
  2. Chamola, V.; Hassija, V.; Sulthana, A.R.; Ghosh, D.; Dhingra, D.; Sikdar, B. A Review of Trustworthy and Explainable Artificial Intelligence (XAI). IEEE Access 2023, 11, 78994–79015. [Google Scholar] [CrossRef]
  3. Chang, D.T. Probabilistic Deep Learning with Probabilistic Neural Networks and Deep Probabilistic Models. arXiv 2021, arXiv:cs.LG/2106.00120. [Google Scholar]
  4. Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
  5. Alicioglu, G.; Sun, B. A survey of visual analytics for Explainable Artificial Intelligence methods. Comput. Graph. 2022, 102, 502–520. [Google Scholar] [CrossRef]
  6. Fawzi, A.; Moosavi-Dezfooli, S.M.; Frossard, P. The Robustness of Deep Networks: A Geometrical Perspective. IEEE Signal Process. Mag. 2017, 34, 50–62. [Google Scholar] [CrossRef]
  7. Fawzi, A.; Fawzi, O.; Frossard, P. Analysis of classifiers’ robustness to adversarial perturbations. Mach. Learn. 2015, 107, 481–508. [Google Scholar] [CrossRef]
  8. Wong, E.; Kolter, Z. Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; Proceedings of Machine Learning Research. PMLR: London, UK, 2018; Volume 80, pp. 5286–5295. [Google Scholar]
  9. Raghunathan, A.; Steinhardt, J.; Liang, P. Certified Defenses against Adversarial Examples. arXiv 2018, arXiv:1801.09344. [Google Scholar]
  10. Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
  11. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
  12. Moosavi-Dezfooli, S.M.; Fawzi, A.; Frossard, P. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2574–2582. [Google Scholar]
  13. Abdollahpourrostam, A.; Abroshan, M.; Moosavi-Dezfooli, S.M. Revisiting DeepFool: Generalization and improvement. arXiv 2023, arXiv:2303.12481. [Google Scholar]
  14. Fefferman, C.; Mitter, S.K.; Narayanan, H. Testing the Manifold Hypothesis. arXiv 2013, arXiv:1310.0425. [Google Scholar] [CrossRef]
  15. Narayanan, H.; Mitter, S.K. Sample Complexity of Testing the Manifold Hypothesis. In Proceedings of the NIPS, Vancouver, BC, Canada, 6–9 December 2010. [Google Scholar]
  16. Grementieri, L.; Fioresi, R. Model-centric Data Manifold: The Data Through the Eyes of the Model. arXiv 2021, arXiv:2104.13289. [Google Scholar] [CrossRef]
  17. Ye, J.C.; Sung, W.K. Understanding Geometry of Encoder-Decoder CNNs. arXiv 2019, arXiv:1901.07647. [Google Scholar]
  18. Zhang, Z.; Yu, W.; Zhu, C.; Jiang, M. A Unified Encoder-Decoder Framework with Entity Memory. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022. [Google Scholar]
  19. Zhang, H.; Li, S.; Chen, Y.; Dai, J.; Yi, Y. A Novel Encoder-Decoder Model for Multivariate Time Series Forecasting. Comput. Intell. Neurosci. 2022, 2022, 5596676. [Google Scholar] [CrossRef]
  20. Ju, C.; Guan, C. Deep Optimal Transport for Domain Adaptation on SPD Manifolds. arXiv 2022, arXiv:2201.05745. [Google Scholar]
  21. Santos, S.; Ekal, M.; Ventura, R. Symplectic Momentum Neural Networks—Using Discrete Variational Mechanics as a prior in Deep Learning. In Proceedings of the Conference on Learning for Dynamics & Control, Stanford, CA, USA, 23–24 June 2022. [Google Scholar]
  22. Karakida, R.; Okada, M.; Amari, S. Adaptive Natural Gradient Learning Based on Riemannian Metric of Score Matching. 2016. Available online: https://openreview.net/pdf?id=lx9lNjDDvU2OVPy8CvGJ (accessed on 27 July 2023).
  23. Amari, S.; Karakida, R.; Oizumi, M. Fisher Information and Natural Gradient Learning of Random Deep Networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Canary Islands, Spain, 9–11 April 2018. [Google Scholar]
  24. Karakida, R.; Akaho, S.; Amari, S. Universal statistics of Fisher information in deep neural networks: Mean field approach. J. Stat. Mech. Theory Exp. 2020, 2020, 124005. [Google Scholar] [CrossRef]
  25. Arvanitidis, G.; González-Duque, M.; Pouplin, A.; Kalatzis, D.; Hauberg, S. Pulling back information geometry. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; Camps-Valls, G., Ruiz, F.J.R., Valera, I., Eds.; Proceedings of Machine Learning Research. PMLR: London, UK, 2022; Volume 151, pp. 4872–4894. [Google Scholar]
  26. Rao, C.R. Information and the Accuracy Attainable in the Estimation of Statistical Parameters. In Breakthroughs in Statistics: Foundations and Basic Theory; Springer: New York, NY, USA, 1992; pp. 235–247. [Google Scholar] [CrossRef]
  27. Amari, S.; Nagaoka, H. Methods of Information Geometry; Fields Institute Communications, American Mathematical Society: Toronto, ON, Canada, 2000. [Google Scholar]
  28. Willmore, T. Riemannian Geometry; Oxford Science Publications, Oxford University Press: Oxford, UK, 1996. [Google Scholar]
  29. Kitagawa, T.; Rowley, J. von Mises-Fisher distributions and their statistical divergence. arXiv 2022, arXiv:2202.05192. [Google Scholar]
  30. Olver, F.W.J.; Olde Daalhuis, A.B.; Lozier, D.W.; Schneider, B.I.; Boisvert, R.F.; Clark, C.W.; Miller, B.R.; Saunders, B.V.; Cohl, H.S.; McClain, M.A. (Eds.) NIST Digital Library of Mathematical Functions; Release 1.1.10 of 2023-06-15. Available online: https://dlmf.nist.gov/ (accessed on 27 July 2023).
  31. Scott, T.R.; Gallagher, A.C.; Mozer, M.C. von Mises-Fisher Loss: An Exploration of Embedding Geometries for Supervised Learning. arXiv 2021, arXiv:cs.LG/2103.15718. [Google Scholar]
  32. Martin, J.; Elster, C. Inspecting adversarial examples using the fisher information. Neurocomputing 2020, 382, 80–86. [Google Scholar] [CrossRef]
  33. Zhao, C.; Fletcher, P.T.; Yu, M.; Peng, Y.; Zhang, G.; Shen, C. The adversarial attack and detection under the fisher information metric. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5869–5876. [Google Scholar]
  34. Lee, J. Introduction to Smooth Manifolds; Graduate Texts in Mathematics; Springer: New York, NY, USA, 2012. [Google Scholar]
  35. Boyom, M.N. Foliations-Webs-Hessian Geometry-Information Geometry-Entropy and Cohomology. Entropy 2016, 18, 433. [Google Scholar] [CrossRef]
Figure 1. Kernel of the pullback metric.
Figure 1. Kernel of the pullback metric.
Entropy 25 01450 g001
Figure 2. Samples with the highest rotation rate.
Figure 2. Samples with the highest rotation rate.
Entropy 25 01450 g002
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Puechmorel, S. Pullback Bundles and the Geometry of Learning. Entropy 2023, 25, 1450. https://doi.org/10.3390/e25101450

AMA Style

Puechmorel S. Pullback Bundles and the Geometry of Learning. Entropy. 2023; 25(10):1450. https://doi.org/10.3390/e25101450

Chicago/Turabian Style

Puechmorel, Stéphane. 2023. "Pullback Bundles and the Geometry of Learning" Entropy 25, no. 10: 1450. https://doi.org/10.3390/e25101450

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop