Information Field Theory and Artificial Intelligence

Information field theory (IFT), the information theory for fields, is a mathematical framework for signal reconstruction and non-parametric inverse problems. Artificial intelligence (AI) and machine learning (ML) aim at generating intelligent systems, including such for perception, cognition, and learning. This overlaps with IFT, which is designed to address perception, reasoning, and inference tasks. Here, the relation between concepts and tools in IFT and those in AI and ML research are discussed. In the context of IFT, fields denote physical quantities that change continuously as a function of space (and time) and information theory refers to Bayesian probabilistic logic equipped with the associated entropic information measures. Reconstructing a signal with IFT is a computational problem similar to training a generative neural network (GNN) in ML. In this paper, the process of inference in IFT is reformulated in terms of GNN training. In contrast to classical neural networks, IFT based GNNs can operate without pre-training thanks to incorporating expert knowledge into their architecture. Furthermore, the cross-fertilization of variational inference methods used in IFT and ML are discussed. These discussions suggest that IFT is well suited to address many problems in AI and ML research and application.


Motivation
Determining the concrete configuration of a field from measurement data is an illposed inverse problem, as physical fields have an infinite number of degrees of freedom (DoF), whereas data sets are always finite in size. Thus, the data provide a finite number of constraints for only a subset of the infinitely many DoF of a field. In order to infer a field, the remaining of its DoF need, therefore, to be constrained via prior information. Fortunately, physics provides such prior information on fields. This information might either be precise, like ∇ · B = 0 in electrodynamics, or more phenomenological, in the sense that a field shaped by a certain process can often be characterized by its n-point correlation functions. Having knowledge on such correlations can be sufficient to regularize the otherwise ill-posed field inference problem from finite and noisy data such that meaningful statements about the field can be made.
The motivation for this work is twofold. On the one hand, understanding conceptual relations between IFT, ML, and AI techniques allows us to transfer computational methods between these domains and to develop synergistic approaches. This article will discuss such. On the other hand, the current success of deep learning techniques for neural networks has let them appear as a synonym for AI in the public perception. This has consequences for decisions about which kind of technologies get scientific funding. The point this paper is trying to make is that if deep learning qualifies as AI in this respect, then this should also apply to a number of other techniques, including those based on IFT.
The paper is organized as follows. IFT is briefly introduced in Section 2 in its most modern incarnation in terms of standardized, generative models. These are shown to be structurally similar to GNNs in Section 3. The structural similarity of IFT inference and GNN training problems allows for a common set of variational inference methods, as discussed in Section 4. Section 5 concludes on the relation of IFT methods and those used in AI and ML and gives an outlook on future synergies.

Basics
IFT allows us to deduce fields from data in a probabilistic way. In order to be able to apply probability theory onto the space of field configurations, a measure in this space is needed. Although no canonical mathematical measure on function spaces exists, for IFT applications, the usage of Gaussian process measures [33], which are mathematically well defined [34,35], is usually fully sufficient. Gaussian processes can also be argued to be a natural starting point for reasoning on fields with known finite first and second order moments, as we will discuss now.
To be specific, let ϕ : Ω → R be a scalar field over some domain Ω ⊂ R u and our prior knowledge on ϕ be the first and second moments of the field, e.g., with ϕ x := ϕ(x) denoting a field value and f (ϕ) (ϕ) := Dϕ P (ϕ) f (ϕ) a prior expectation value for some function f of the field. If only the first and second field moments are given as prior information, it follows from the maximum entropy principle that the least informative probability distribution function encoding this information is a Gaussian with these moments. Thus, using this Gaussian as a prior with background information is a conservative choice, as it makes the least additional assumptions about the field except for the moments specified in I.
In many applications, however, the field of interest, the signal s, is not a Gaussian field, but may be related to such via a non-linear transformation. For example, in astronomical applications of IFT, the sky brightness field s is the quantity of interest, which is strictly positive, and therefore cannot be a Gaussian field. However, the logarithm of a brightness can be positive and negative and may therefore be modeled as a Gaussian process. In such a case, one could assign, e.g., s x (ϕ) = s 0 exp(ϕ x ) as a model for a diffuse (spatially correlated) sky emission component, with s 0 a reference brightness, chosen such that for example ϕ x (ϕ) = 0 holds.
Having established a field prior, Bayesian reasoning on the field ϕ, and therefore on the signal of interest s = s(ϕ), based on some data d and its likelihood P (d|ϕ, I) is possible. The field posterior P (ϕ|d, I) = P (d|ϕ, I)P (ϕ|I) is defined as well as the prior and permits us to answer questions about the field, like its most probable configuration ϕ MAP = argmax ϕ P (ϕ|d, I) (MAP = maximum a posteriori), its posterior mean m = ϕ (ϕ|d,I) , or its posterior uncertainty dispersion D = (ϕ − m)(ϕ − m) † (ϕ|d,I) . IFT exploits the formalism of quantum and statistical field theory to calculate such posterior expectation values [1,28,[36][37][38]. These formal calculations, however, should not be the focus here. Instead, it should be the formulation of IFT inference problems in terms of generative models, as these can be interpreted as GNNs.
For this purpose, the likelihood is expressed in terms of a measurement equation which is always possible if the data can be embedded into a vector space and the data expectation value d (d|ϕ) exists. Here and in the following, we omit the background information I in probabilities. This rewriting of the likelihood in terms of a mean instrument response d = R(ϕ) to the field ϕ and a noise process P (n|ϕ), which summarizes the fluctuations around that mean d , allows us to regard the data as the result of a noisy generative process that maps field values ϕ and associated noise realizations n onto data d according to Equation (5).
In case the instrument response and noise processes are provided for the signal s instead of the Gaussian field ϕ as R (s) := d (d|s) and P (n|s), their respective pull backs R(ϕ) := d (d|s(ϕ)) = R (s(ϕ)) and P (n|ϕ) := P (n|s(ϕ)) provide the necessary response and noise statistics w.r.t. the field ϕ.
All this provides a generative model for the signal s and data d via ϕ ← G(ϕ, Φ), s = s(ϕ), n ← P (n|s), and d = R (s) + n, which should now be standardized. The standardization introduces a generic latent space that permits better comparison to GNNs used in AI and ML and simplifies the usage of variational inference methods discussed later on.

Prior Standardization
Standardization of a random variable ϕ refers to finding a mapping from a standard normal distributed random variable ξ ← G(ξ, 1) to ϕ that reproduces the statistics of P (ϕ). For a Gaussian field ϕ, this is just a mapping of the form where Φ 1 2 refers to a square root of Φ, which always exists for a covariance matrix that is positive definite. For the large class of band diagonal and therefore translational invariant covariance matrices Φ, which are very relevant for applications as we argue below, the square root of Φ can be explicitly constructed.

Power Spectra
In many signal inference problems, no spatial location is singled out a priori, before the measurement. This means that the field covariance between two locations only depends on the distance between these positions, but not on their absolute positions. Thus, Φ xy = C ϕ (x − y). As a consequence of the Wiener-Khinchin theorem, such a translational invariant field covariance becomes diagonal in harmonic space, Here and in the following, F denotes a harmonic transform (a u-dimensional Fourier transform F k x = exp(ik · x) in case of an Euclidean space, as we assume in the following), † the adjoint (complex conjugate and transposed of a matrix or vector), P ϕ (k) := F k x C x ϕ is the so called power spectrum of ϕ, the Einstein convention for repeated indices is used, as in ϕ k := F k x ϕ x ≡ dx u exp(ik · x) ϕ(x), and φ = diag(φ) denotes a diagonal operator in the space of the field φ with the values of φ on the diagonal.
Thanks to this diagonal representation of the field covariance in harmonic space, an explicit standardization of the field is given via where the latter is an amplitude operator that is diagonal in harmonic space and that imprints the right amplitudes onto the Fourier modes of ϕ. This can be seen via a direct calculation, In case no direction of the space is singled out a priori, the two-point correlation function and the power spectrum of ϕ become isotropic, Φ xy = C ϕ (|x − y|) and Φ kq = (2π) u δ(k − q)P ϕ (|k|), respectively. In this case, only a one-dimensional power spectrum needs to be known. Such power spectra are often smooth functions on a double logarithmic scale in Fourier space, since any sharp feature in them would correspond to a (quasi-) periodic pattern in position space, which would be very unnatural for most signals. Thus, introducing the logarithmic Fourier space scale variable κ(k) := ln k/k 0 w.r.t. some reference scale k 0 , we expect to be a field itself, in the sense that it is sufficiently smooth. Here, P 0 is a pivot scale for the power spectrum.

Amplitude Model
Often, the power spectrum as parameterized through ψ is not known a priori for a field ϕ, but statistical homogeneity, isotropy, and the absence of long range quasi-periodic signal variations make a Gaussian field prior for ψ plausible, P (ψ) = G(ψ − ψ, Ψ). This log-log-power spectrum may exhibit fluctuations χ := ψ − ψ around a non-zero mean ψ(κ). The latter might, e.g., encode a preference for falling spectra and therefore for a spatially smooth field ϕ. In this case, just another layer for χ of a standardized generative model has to be added, Again, a prior for a field, here the only one dimensional χ(κ), is needed. A detailed description of how this amplitude model can be provided efficiently is given by [15]. This reference also provides a generative model for the case that the signal domain Ω is a product of sub-spaces, like position space and an energy spectrum coordinate, each requiring a different correlation structure, and the total correlation being a direct product of those. Assuming a direct product for the correlation structures might be possible for many field inference problems [15,39].

Dynamical Systems
Let us take a brief detour to fields shaped by dynamical systems. Dynamical systems, typically exhibit correlation structures that are not direct products of the spatial and temporal sub-spaces, as was proposed above. Here, the full spatial and temporal Fourier power spectrum P ϕ (k, ω), with ω being the temporal frequency, encodes the full dynamics of a linear, homogeneous, and autonomous system. For example, a stochastic wave field ϕ(x, t) may follow the dynamical equation where c is the wave velocity and η a damping constant. The field dynamics are determined by a response operator (or Green's function) G that is a convolution of the exciting noise field ξ with a kernel g, where * denotes convolution. In Fourier-space, this kernel can be applied by a direct point wise multiplication, (F ϕ) (k,ω) = (F g) (k,ω) (F ξ) (k,ω) and is given by If the excitation of field fluctuations is caused by a white, stochastic noise field ξ ← G(ξ, 1), the resulting field has a power spectrum of In this case, the spectrum is an analytical function in ω and k. This results from Equation (20) being a linear, homogeneous, and autonomous partial differential equation. Linear integro-differential equations, however, can still be solved by convolutions, in which case the kernel might not have an analytically closed form any more, if the equation is still homogeneous and autonomous. For example, in neural field theory [40][41][42][43], a macroscopic description of the brain cortex dynamics, the neural activity ϕ(x, t) might be described by Here, w is a spatial-temporal convolution kernel (that usually contains a delta function in time), f : R → R an activation function that is applied point wise to the field, ( f • ϕ)(x, t) = f (ϕ(x, t)), and we added an input term ξ. In case f is linear, the system responds linearly to inputs. Then, the input response is a convolution with a kernel g that has in general a non-analytical spectrum, where f is the slope of f and F w the Fourier transformed kernel of the dynamics.
An inference of such non-analytical and highly structured response spectra from data is possible with IFT and can be used to learn the system dynamics from noisy system measurements [26,44]. It just requires a more complex spectral prior than discussed here. Let us now return to our main line of argumentation.

Generative Model
To summarize, the field inference problems of IFT can often be stated in terms of a standardized, generative model for the signal and the data. For the illustrative case outlined above, where the probabilistic model is given by the corresponding standardized generative model is n ← P (n|s), and This generative model is illustrated in Figure 1. Variants of it are used in a number of real world data applications [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. Its performance in generative and reconstruction mode is illustrated for synthetic data in Figures 2 and 3. For the noiseless data d = R (s) the generative model reads This way, the full model complexity as given by Equations (26)-(31) is transferred into an effective response function d = R • s • ϕ • f . For this latent variable vector, the prior is simply P (ζ) = G(ζ, 1), whereas the likelihood P (d|ζ) = P (n = d − d (ζ)|ζ) has absorbed the full model complexity. This so called reparametrization trick [45] was introduced to IFT by [29] to simplify numerical variational inference. At this point, it is essential to realize that this generative model consists of a latent space white noise process P (ζ) = G(ζ, 1) that generates an input vector ζ and a sequence of non-local linear and local non-linear operations that is applied to it. The Fourier transform F −1 and A ψ are examples of non-local linear operations within the model. Among the non-linear operations are the exponential functions and the application of the ψ-dependent amplitude operator A ϕ (ψ) to the latent space excitations ξ, as there the two components of ζ = (ξ, η) are multiplied together. Furthermore, the instrument response R (s) might also be decomposed into sequences of non-local linear and local non-linear operations, as physical processes in measurement devices can often be cast into the propagation of a quantity (an operation that is linear in the quantity) and the local interactions of the quantity (an operation non-linear in it), respectively.    [46][47][48] that supports the implementation and inference of IFT models.

Neural Networks
AI and ML are vast fields. AI aims at building artificial cognitive systems that perceive their environment, reason about its state and the systems' best actions, and learn to improve their performance. ML can be regarded as a sub-field of AI, embracing many different methods like self-organized maps, Gaussian mixture models, deep neural networks, and many others. Here, the focus should be on specific neural networks, GNNs, as those have a close relation to the generative IFT models introduced before.
GNNs transform a latent space variable ξ ← G(ξ, 1) into a signal or data realization, s = s(ξ) or d = d (ξ). A neural network is a function g(ξ) that can be decomposed in terms of n layer processing functions g i with g = g n • g n−1 • . . . g 1 .
Any of the layer processing functions g i : ξ i → ξ i+1 with ξ 1 ≡ ξ consists typically of a non-local, affine linear transformation l i (ξ i ) := L i ξ i + b i of the input vector ξ i of layer i followed by a local, point wise application of non-linear, so-called activation functions σ i : R → R. Thus, the output vector ξ i+1 of layer i is where σ i acts component wise. The set η = (L i , b i ) n i=1 of all coefficients of the l i s (the matrix elements of the L i matrices, and the components of the b i vectors) determines the function the network represents. Putting the input values and network coefficients into a single vector ζ := (ξ, η) a GNN can be regarded as a function of both, latent variables ξ and network parameters η, d (ζ) = g(ξ; η).

Comparison with IFT Models
From this abstract perspective, a standardized, generative model d (ζ) in IFT is structurally a GNN, as both consist of sequences of local non-linear and non-local linear operations on their input vector ζ = (ξ, η). The concrete architecture of an IFT model and a typical GNN might differ significantly, as GNNs often map a lower dimensional latent space into a higher dimensional data or feature space, whereas the dimension of the IFT model latent space can be very high, as it contains a subset of the virtually infinite many degrees of freedom of a field, see Figure 1.
Additionally, the way IFT-based models and GNNs are usually used differs a bit. Both can be used to generate synthetic samples of outputs by processing random latent space vectors ξ ← G(ξ, 1). However, typically an IFT model d (ζ) is applied to infer all latent space variables in ζ from data d. From the latent variables, the signal of interest can always be recovered via s(ζ).
For this inference the so-called information Hamiltonian, potential, or energy is investigated with respect to ζ, where H(a|b) := − ln P (a|b). This quantity is introduced to IFT in analogy to statistical mechanics, it summarizes the full knowledge on the problem (as it is just a logarithmic coordinate transformation in the space of probabilities) and has the nice property, that it allows to speak about information as an additive quantity, as H(a, b) = H(a|b) + H(b). Investigating the relevant information Hamiltonian for our IFT problem H(d, ζ) can be done, for example, by minimizing it to obtain a MAP estimator for ζ or-as discussed in the next section-via variational inference (VI). In case of a constant, signal independent Gaussian white noise statistics, the information Hamiltonian becomes The training of an usual GNN is done with a training data set d = (d i ) i to which a corresponding latent space vector set ζ = (ζ i ) i and common network parameters η need to be found. For this a loss function of the form might be minimized. Here, a typical GNN data loss function H( d i |ξ i , η) as used for the decoder part of an autoencoder (AE) [49] was assumed. In an generative adversarial network (GAN) [50], however, this data loss function is given in terms of the output of a discriminator network. The network parameter prior term H(η) might be chosen to be uninformative ( H(η) = const) or informative (e.g., H(η) = 1 2 η † η in case of a Gaussian prior on the parameters).
Anyhow, by comparison of Equations (45)- (47) with Equations (43) and (44), it should be apparent that the network loss functions can be structurally similar to the IFT information Hamiltonian. Both consist of a standardized quadratic prior-energy and a likelihood-energy and both can have a probabilistic interpretation in terms of being negative log-probabilities, e.g., P (d, ξ, η) = e −H(d,ξ,η) and (48) respectively. For this reason, we do not distinguish between an information Hamiltonian H and a network loss function H by writing H for both in the following. The IFT-GNN can operate with solely a single data vector d due to the domain knowledge coded into their architecture, whereas usual GNNs require sets of data vectors d = (d i ) i to be trained. Recently, more IFT-like architectures for GNNs were proposed as well, which are also able to process data without training [51].

Basic Idea
So far, it has been assumed here that MAP estimators are used to determine network parameters ζ for both, IFT-based models as well as traditional GNNs. MAP estimators are known to be prone to over-fitting the data, as they are not probing the adjacent phasespace volumes of their solutions. VI methods perform better in that respect, while still being affordable in terms of computational costs for the high dimensional settings of IFT-based field inference and traditional GNN training. They were used in most recent IFT applications [12][13][14][15][16][17][18]21,52] and are prominently present in the name of variational autoencoders (VAEs) [45] that are built on VI.
In VI, the posterior P (ζ|d) is approximated by a simpler probability distribution Q(ζ|d ), in many applications by a Gaussian where d = (θ, Θ). The Gaussian is chosen to minimize the variational Kullback-Leibler with respect to the parameters of d , θ and Θ in our case. Ideally, all degrees of freedom (DoF) of θ and Θ are optimized. In practice, however, this is often not feasible due to the quadratic scaling of the number of DoF of Θ with that of θ. Three approximate schemes for handling the high dimensional uncertainty covariance will be discussed in the following, leading to the ADVI, MGVI, and geoVI techniques introduced below, namely • mean field theory, in which Θ is assumed to be diagonal, as used by ADVI • the usage of the Fisher information to approximate Θ as a function of θ and thereby effectively removing the DoF of Θ from the optimization problem as used by MGVI • a coordinate transformation of the latent space that approximately standardizes the posterior and therefore sets the covariance to the identity matrix in the new coordinates, as performed by geoVI.
Before these are discussed, a note that applies to all of them is in order. Optimizing of the VI KL, Equation (51), is slightly sub-optimal from an information theoretical point of view as this minimizes the amount of information introduced by going from P to Q. The expectation propagation (EP) KL with reversed arguments D KL (P ||Q) would be better, as it minimizes the information loss from approximating P with Q [53]. VI is known to underestimate the uncertainties, whereas EP conservatively overestimates them. However, calculating the EP solution for θ and Θ would require integrating over the posterior. If this would be feasible, any posterior quantity of interest could be calculated as well and there would be no need to approximate P (ζ|d) in the first place. Estimating and minimizing the VI KL D KL (Q||P ) is less demanding, as the integral over the simpler (Gaussian) distribution Q can very often be performed analytically, or by sample averaging using samples drawn from Q.

ADVI and Mean Field Approximation
In all here discussed VI techniques, the posterior mean θ and the posterior uncertainty covariance Θ become parameters to be determined. The vector θ has the dimension N dim of the latent space, whereas the posterior uncertainty covariance Θ has N dim (N dim − 1)/2 = O(N 2 dim ) independent DoF. For small problems, these might be solved for, however, for large problems with millions of DoF, these cannot even be stored in a computer memory. To circumvent this, the Automatic Differentiation Variational Inference (ADVI) algorithm [54] often invokes the so called mean field approximation (MFA). This assumes a diagonal covariance Θ MFA = θ ≡ diag(θ ), with θ being a latent space vector. Cross-correlations between parameters can not be represented by this, which is problematic in particular in combination with the tendency of VI to underestimate uncertainties.

MGVI and Fisher Information Metric
In order to overcome this limitation of ADVI that limits its usage in IFT contexts with their large number of DoF, the Metric Gaussian Variational Inference (MGVI) [30] algorithm approximates the posterior uncertainty of ζ with the help of the Fisher information metric The starting point for obtaining the uncertainty covariance Θ used in MGVI is the Hessian of the log-posterior as a first guess for the approximate posterior precision matrix Θ −1 . Using this evaluated at the minimum ζ MAP of the information Hamiltonian H(d, ζ) would correspond to the Laplace approximation, in which the posterior is replaced by a Gaussian obtained from doing a saddle point approximation at its maximum. However, neither is the MAP solution ideal, as discussed above, nor would this be a good approximation at many locations ζ that differ from ζ MAP . This is because positive definiteness of the Hessian is not guaranteed there, but it is an essential property of any correlation and precision matrix. For this reason, Θ −1 cannot directly be approximated by this Hessian.
It turns out that the likelihood averaged Hessian is strictly positive definite, and is therefore a candidate for an approximate posterior precision matrix for any guessed posterior mean θ. A short calculation shows that the likelihood averaged Hessian is indeed positive definite: The last step follows because the Fisher metric M(θ) is an average over outer products (v v † ≥ 0) of likelihood Hamiltonian gradient vectors v = ∂H(d|ζ)/∂ζ and thereby positive semi-definite. Adding 1 > 0 to the Fisher metric turns the approximate precision matrix into a positive definite matrix Θ −1 (θ) > 0, of which the inverse Θ(θ) exists for all θ, and which is positive definite as well.

Exact Uncertainty Covariance
Being positive definite is of course not the only property an approximation of the posterior uncertainty covariance has to fulfill. It also has to approximate well. Fortunately, this seems to be the case in many situations. The likelihood averaged Laplace approximation actually becomes the exact posterior uncertainty in case of linear Gaussian measurement problems as is shown in the following. If it is exact in such linear situations, it should be a valid approximation in the vicinity of any linear case.
For linear measurement problems, the measurement equation is of the form d = Rζ + n, the noise statistics P (n|ζ) = G(n, N), and the standardized prior is P (ζ) = G(ζ, 1). The corresponding posterior is known to be a Gaussian with mean m and covariance D given by the generalized Wiener filter solution m = D R † N −1 d and the Wiener covariance D = (1 + R † N −1 R) −1 , respectively (e.g., [1]). In this case, the Fisher information metric M = R † N −1 R is independent of ζ. The approximate posterior uncertainty covariance as given by Equation (54) equals the exact posterior covariance, Thus indeed, the adopted approximation becomes exact in this situation. This should show why this approximation can hold sensible results in sufficiently well behaved cases, in particular when a linearization of the inference problem around a reference solution (e.g., a MAP estimate) is already a good approximation. Furthermore, for all signal space directions around this reference point that are unconstrained by the data, this covariance approximation returns the prior uncertainty, as it should. Additional discussion of this approximation can be found in Knollmüller and Enßlin [30], where also its performance with respect to ADVI is numerically investigated.
The important point about this approximate uncertainty covariance Θ is that it is a function of the latent space mean estimate θ, i.e., Θ(θ), and therefore does not need to be inferred as well. For many likelihoods, the Fisher metric is available analytically, alleviating the need to store Θ in a computer memory as an explicit matrix. It is only necessary that certain operations can be performed with Θ, like applying it to a vector or drawing samples from a Gaussian with this covariance. Relying solely on those memory inexpensive operations, the MGVI algorithm is able to minimize the relevant VI KL, namely KL ζ ((θ, Θ(θ)), d) = D KL (Q, P ), with respect to the approximate posterior mean θ. The result of MGVI are then the posterior mean θ, the uncertainty covariance Θ(θ), and posterior samples {ζ i } i drawn according to this mean and covariance. These samples can then be propagated into posterior signal samples s i = s(ζ i ), from which any desired posterior signal statistics can be calculated.
MGVI has enabled field inference for problems, which are too complex to be solved by MAP, in particular when multiple layers of hyperpriors were involved (e.g., [13,15]).

Geometric Variational Inference
ADVI's and MGVI's weak point, however, can be the Gaussian approximation of the posterior, which might be strongly non-Gaussian in certain applications. In order to overcome this, the geometrical variational inference (geoVI) algorithm [32] was introduced as an extension of MGVI. geoVI puts another coordinate transformation on top of the one used by MGVI, so that ζ = g 0 (y) -with g 0 to be performed before any of the other IFT-GNN operations g 1 , . . . g n -approximately standardizes the posterior, P (y|d) ≈ G(y, 1). Astonishingly, this transformation can be constructed without the (prohibitive) usage of any explicit matrix or higher order tensor in the latent space, thus also allowing us to tackle very high dimensional inference problems, like MGVI. The transformation is basically a normalizing flow (network) [55], just with the difference to their usual usage in ML, that the geoVI flow does not need to be trained, but is derived from the problem statement in form of its information Hamiltonian in an automated fashion. Specifically, the coordinate transformation g 0 is defined to solve the constraining equation ∂g 0 ∂y ζ=g 0 (y) which fully specifies g 0 up to an integration constant θ. This remaining constant is solved for by minimizing the VI KL with respect to θ to retrieve the optimal geoVI aproximation. With geoVI, deeper hierarchical models, which more often exhibit non-Gaussian posteriors due to a larger number of degenerate parameters in them, can be approached via VI. The ability of geoVI to provide uncertainty information is illustrated in Figure 2 (bottom middle and right panels) and in Figure 3. Further details on geoVI and detailed comparisons of ADVI, MGVI, geoVI, and Hamiltonian Monte Carlo methods can be found in [32].

Conclusions and Outlook
This paper argues that IFT techniques can well be regarded as ML and AI methods by showing their interrelation with GNNs, normalizing flows, and VI techniques. This insight is not necessarily new, as this paper just summarizes a number of recent works [29][30][31][32] that suggested this before.
First, the generative models build and used in IFT are GNNs that can interpret data without initial training, thanks to the domain knowledge coded into their architecture [29]. Related architectures have very recently been proposed as image priors in the context of neural network architectures as well [51]. As IFT models and the newly proposed image priors do not obtain their intelligence from data driven learning, they are strictly not ML techniques, but might be characterized as (expert) knowledge-driven AI systems. From a technical point of view, however, such a distinction could be seen as splitting hairs.
Second, the VI algorithms used in IFT and AI to approximately infer quantities are a natural interface between these areas. Here, the related ADVI [54], MGVI [30], and geoVI [32] algorithms were briefly discussed, which can be used in classical ML and AI as well as in IFT applications.
And third, the common probabilistic formulation of IFT models and GNNs, as well as the common VI infrastructure of the two areas allows for combining pre-trained GNNs and other networks with IFT-style model components. In that respect, the possibility to perform Bayesian reasoning with trained neural networks as described in [31] might give an outlook on the potential to combine IFT with other ML and AI methods.