A Measure of Information Available for Inference

The mutual information between the state of a neural network and the state of the external world represents the amount of information stored in the neural network that is associated with the external world. In contrast, the surprise of the sensory input indicates the unpredictability of the current input. In other words, this is a measure of inference ability, and an upper bound of the surprise is known as the variational free energy. According to the free-energy principle (FEP), a neural network continuously minimizes the free energy to perceive the external world. For the survival of animals, inference ability is considered to be more important than simply memorized information. In this study, the free energy is shown to represent the gap between the amount of information stored in the neural network and that available for inference. This concept involves both the FEP and the infomax principle, and will be a useful measure for quantifying the amount of information available for inference.


Introduction
Sensory perception comprises complex responses of the brain to sensory inputs. For example, the visual cortex can distinguish objects from their background [1], while the auditory cortex can recognize a certain sound in a noisy place with high sensitivity, a phenomenon known as the cocktail party effect [2][3][4][5][6][7]. The brain (i.e., a neural network) has acquired these perceptual abilities without supervision, which is referred to as unsupervised learning [8][9][10]. Unsupervised learning, or implicit learning, is defined as the learning that happens in the absence of a teacher or supervisor; it is achieved through adaptation to past environments, which is necessary for higher brain functions. An understanding of the physiological mechanisms that mediate unsupervised learning is fundamental to augmenting our knowledge of information processing in the brain.
One of the consequent benefits of unsupervised learning is inference, which is the action of guessing unknown matters based on known facts or certain observations, i.e., the process of drawing conclusions through reasoning and estimation. While inference is thought to be an act of the conscious mind in the ordinary sense of the word, it can occur even in the unconscious mind. Hermann von Helmholtz, a 19th-century physicist/physiologist, realized that perception often requires inference by the unconscious mind and coined the word unconscious inference [11]. According to Helmholtz, conscious inference and unconscious inference can be distinguished based on whether conscious knowledge is involved in the process. For example, when an astronomer computes the positions or distances of stars in space based on images taken at various times from different parts of the orbit of the Earth, he or she performs conscious inference because the process is "based on a conscious knowledge of the laws of optics"; by contrast, "in the ordinary acts of vision, this knowledge of optics is lacking" [11]. Thus, the latter process is performed by the unconscious mind. Unconscious inference is crucial for estimating the overall picture from partial observations. Let us suppose s ≡ (s 1 , . . . , s N ) T as hidden sources that follow p(s|λ) ≡ ∏ i p(s i |λ) parameterized by a hyper-parameter set λ; x ≡ (x 1 , . . . , x M ) T as sensory inputs; u ≡ (u 1 , . . . , u N ) T as neural outputs; z ≡ (z 1 , . . . , z M ) T as background noises that follow p(z|λ) parameterized by λ; ≡ ( 1 , . . . , M ) T as reconstruction errors; and f ∈ R M , g ∈ R N , and h ∈ R M as nonlinear functions (see also Table 1). The generative process of the external world (or the environment) is described by a stochastic equation as: Generative process : x = f (s) + z. (1) Recognition and generative models of the neural network are defined as follows: Generative model : Figure 1 illustrates the structure of the system under consideration. For the generative model, the prior distribution of u is defined as p * (u|γ) = ∏ i p * (u i |γ) with a hyper-parameter set γ and the likelihood function as p * (x|h(u), γ) = N [x; h(u), Σ (γ)], where p * indicates a statistical model and N is a Gaussian distribution characterized by the mean h(u) and covariance Σ (γ). Moreover, suppose θ, W ∈ R N×M , and V ∈ R M×N as parameter sets for f , g, and h, respectively, λ as a hyper-parameter set for p(s|λ) and p(z|λ), and γ as a hyper-parameter set for p * (u|γ) and p * (x|h(u), γ). Here, hyper-parameters are defined as parameters that determine the shape of distributions (e.g., the covariance matrix). Note that W and V are assumed as synaptic strength matrices for feedforward and backward paths, respectively, while γ is assumed as a state of neuromodulators similarly to [13][14][15]. In this study, unless specifically mentioned, parameters and hyper-parameters refer to slowly changing variables, so that W, V, and γ can change their values. Equations (1)-(3) are transformed into probabilistic representations.
The actual probability density of x p(ϕ|x), p(x, ϕ), p(ϕ) Actual probability densities (posterior densities) KLD between p(·) and p * (·) Mutual information between x and ϕ Utilizable information between x and ϕ Generative process : p(s, x|θ, λ) = p(x|s, θ, λ)p(s|λ) Recognition model : p(x, u|W) = p(x|u, W)p(u|W) Generative model : Note that δ(·) is Dirac's delta function and p * (x|u, V, γ) ≡ p(x|u, V, γ, m) is a statistical model given a model structure m. For simplification, let ϑ ≡ {s, θ, λ} be a set of hidden states of the external world and ϕ ≡ {u, W, V, γ} be a set of internal states of the neural network. By multiplying p(θ, λ) by Equation (4), p(W, V, γ) by Equation (5), and p * (W, V, γ) = p * (W)p * (V)p * (γ) by Equation (6), Equations (4)-(6) become Generative process : p(x, ϑ) = p(x|ϑ)p(ϑ) = p(z = x − f |ϑ)p(ϑ), Recognition model : Generative model : where p * (ϕ) = p * (u|γ)p * (W, V, γ) is the prior distribution for ϕ and p * (x, ϕ) ≡ p(x, ϕ|m) is a statistical model given a model structure m, which is determined by the shapes of p * (ϕ) and p * (x|ϕ) ≡ p * (x|u, V, γ). The expression of p * (x, ϕ) is used instead of p(x, ϕ|m) to emphasize the difference between p(x, ϕ) and p * (x, ϕ). While p(x, ϕ) ≡ p(u|x, W)p(W, V, γ|x)p(x) is the actual joint probability of (x, ϕ) (which corresponds to the posterior distribution), p * (x, ϕ), i.e., the product of the likelihood function and the prior distribution, represents the generative model that the neural network expects (x, ϕ) to follow. Typically, elements of p * (W, V, γ) are supposed to be independent of each other, . For example, sparse priors about parameters are sometimes used to prevent the over-learning [37], while a generative model with sparse priors for outputs is known as a sparse coding model [38,39]. As shown later, the inference and learning are achieved by minimizing the difference between p(x, ϕ) and p * (x, ϕ). At that time, minimizing the difference between p(V, W, γ) and p * (V, W, γ) acts as a constraint or a regularizer that prevents over-learning (see Section 2.3 for details).

Information Stored in the Neural Network
Information is defined as the negative log of probability [40]. When Prob(x) is the probability of given sensory inputs x, its information is given by − log Prob(x) [nat], where 1 nat = 1.4427 bits. When x takes continuous values, by coarse graining, − log Prob(x) is replaced with − log(p(x)∆ x ), where p(x) is the probability density of x and ∆ x ≡ ∏ i ∆ x i is the product of the finite spatial resolutions of x's elements (∆ x i > 0). The expectation of − log(p(x)∆ x ) over p(x) gives the Shannon entropy (or average information), which is defined by where · p(x) ≡ ·p(x)dx represents the expectation of · over p(x). Note that the use of − log(p(x)∆ x ) instead of − log p(x) is useful because this H[x] is non-negative (dProb(x) = p(x)∆ x takes a value between 0 and 1). This is a coarse binning of x and the spatial resolution ∆ x takes a small but nonzero value so that the addition of constant − log ∆ x has no effect except for sliding the offset value. If and only if p(x) is Dirac's delta function (strictly, p(x) = 1/∆ x at one bin and 0 otherwise), H[x] = 0 is realized. For the system under consideration (Equations (7)-(9)), the information shared between the external world states (x, ϑ) and the internal states of the neural network ϕ is defined by mutual information [41] Note that p(x, ϑ, ϕ) is the joint probability of (x, ϑ) and ϕ. Moreover p(x, ϑ) and p(ϕ) are their marginal distributions, respectively. This mutual information takes a non-negative value and quantifies how much (x, ϑ) and ϕ are related with each other. High mutual information indicates the internal states are informative for explaining the external world states, while zero mutual information means they are independent of each other. However, the only information that the neural network can directly access is the sensory input. This is the case because the system under consideration can be described as a Bayesian network (see [42,43] for details on the Markov blanket). Hence, the entropy of the external world states under a fixed sensory input gives information that the neural network cannot infer. Moreover, there is no feedback control from the neural network to the external world in this setup. Thus, under a fixed x, ϑ and ϕ are conditionally independent of each other. From p(ϑ, ϕ|x) = p(ϑ|x)p(ϕ|x), we can obtain Using Shannon entropy, I[x; ϕ] becomes where is the conditional entropy of x given ϕ.
. The KLD takes a non-negative value and indicates the divergence between two distributions. The infomax principle states that "the network connections develop in such a way as to maximize the amount of information that is preserved when signals are transformed at each processing stage, subject to certain constraints" [29], see also [30][31][32]. According to the infomax principle, the neural network is hypothesized to maximize I[x; ϕ] to perceive the external world. However, I[x; ϕ] does not fully explain the inference capability of a neural network. For example, if neural outputs just express the sensory input itself (u = x), I[x; ϕ] = H[x] is easily achieved, but this does not mean that the neural network can predict or reconstruct input statistics. This is considered in the next section.

Free-Energy Principle
If one has a statistical model determined by model structure m, the information calculated based on m is given by the negative log likelihood − log p(x|m), which is termed as the surprise (or the marginal likelihood) of the sensory input and expresses the unpredictability of the sensory input for the individual. The neural network is considered to minimize the surprise in the sensory input using the knowledge about the external world, to perceive the external world [13]. To infer if an event is likely to happen based on the past observation, a statistical (i.e., generative) model is necessary; otherwise it is difficult to generalize sensory inputs [45]. Note that the surprise is the marginal over the generative model; hence, the neural network can reduce the surprise by optimizing its internal states, while Shannon entropy of the input is determined by the environment. When the actual probability density and a generative model are given by p(x) and p * (x) ≡ p(x|m), respectively, the cross entropy − log(p * (x)∆ x ) p(x) is always larger than or equal to Shannon entropy H[x] because of the non-negativity of KLD. Hence, in this study, the input surprise is defined by and its expectation over p(x) by is determined by the environment and constant for the neural network, minimization of this S is the same meaning as minimization of − log(p * (x)∆ x ) p(x) .
As the sensory input is generated by the external world generative process, consideration of the structure and dynamics placed in the background of the sensory input can provide accurate inference. According to the internal model hypothesis, animals develop the internal model in their brain to increase the accuracy and efficiency of inference [12][13][14][15][17][18][19]; thus, internal states of the neural network ϕ are hypothesized to imitate the hidden states of the external world ϑ. A problem is that − log p * (x) = − log( p * (x, ϕ)dϕ) is intractable for the neural network, because the integral of p * (x, ϕ) placed in the logarithm function. The FEP hypothesizes that the neural network calculates an upper bound of − log p * (x) instead of the exact value as a proxy, which is more tractable [13] (because − log p(x) is fixed, the free energy is sometimes defined including or excluding this term). This upper bound is termed as variational free energy: Note that p(ϕ|x) ≡ p(u|x, W)p(W, V, γ|x) expresses the belief about hidden states of the external world encoded by internal states of the neural network, termed as the recognition density. Due to the non-negativity of KLD, F(x) is guaranteed to be an upper bound of S(x) and F(x) = S(x) holds if and only if p * (ϕ|x) = p(ϕ|x). Furthermore, the expectation of F(x) over p(x) is defined by where − log(p * (x|ϕ)∆ x ) p(x,ϕ) is the negative log likelihood and called the accuracy [15]. The second and third terms are the cross entropy of ϕ and the conditional entropy of ϕ given x, where the difference between them is called the complexity [15]. The last term H[x] is a constant. F indicates the difference between the actual probability p(x, ϕ) and the generative model p * (x, ϕ). Given the non-negativity of KLD, F is always larger than or equal to non-negative value S, and F = S = 0 holds if and only if p * (x, ϕ) = p(x, ϕ). The FEP hypothesized that F is minimized by optimizing neural activities (u), synaptic strengths (W and V; i.e., synaptic plasticity), and activities of neuromodulators (γ). The accuracy − log(p * (x|ϕ)∆ x ) p(x,ϕ) quantifies the amplitude of the reconstruction error. Minimization of the accuracy is the maximum likelihood estimation [10] and provides a solution that (at least locally) minimizes the reconstruction error. Whereas, minimization of the complexity − log(p * (ϕ)∆ ϕ ) p(ϕ) − H[ϕ|x] makes p(ϕ) closer to p * (ϕ). As p * (ϕ) = p * (u|γ)p * (W, V, γ) usually supposes the elements of ϕ are mutually independent, this acts as the maximization of the entropy under a constraint. Hence, this leads to the increase of the independence between internal states, which helps neurons to establish an efficient representation, as pointed out by Jaynes' max entropy principle [46,47]. This is essential for BSS [33][34][35][36] because the optimal parameters that minimize the accuracy are not always uniquely determined. Due to this, the maximum likelihood estimation alone does not always identify the generative process behind the sensory inputs. As F is the sum of costs for the maximum likelihood estimation and BSS, free-energy minimization is the rule to simultaneously minimize the reconstruction error and maximize the independence of the internal states. It is recognized that animals perform BSS [2][3][4][5][6][7]. Interestingly, even in vitro neural networks perform BSS, which is accompanied by significant reduction of free energy in accordance with the FEP and Jaynes' max entropy principle [48].

Information Available for Inference
We now consider how free energy expectation F relates to mutual information I[x; ϕ]. According to unconscious inference and the internal model hypothesis, the aim of a neural network is to predict x, and for this purpose, it infers hidden states of the external world. While the neural network is conventionally hypothesized to express sufficient statistics of the hidden states of the external world [14], here it is hypothesized that internal states of the neural network are random variables and the probability distribution of them imitates the probability distribution of the hidden states of the external world. The neural network hence attempts to match the joint probability of the sensory inputs and the internal states with that of the sensory inputs and the hidden states of the external world. To do so, the neural network shifts the actual probability of internal states p(x, ϕ) = p(x|ϕ)p(ϕ) closer to those of the generative model p * (x, ϕ) = p * (x|ϕ)p * (ϕ) that the neural network expects (x, ϕ) to follow (note that here, p(x|ϕ) = p(x|u, W) and p * (x|ϕ) = p * (x|u, V, γ)). This means that the shape or structure of p * (x, ϕ) is pre-defined, but the argument (x, ϕ) can still change. From this viewpoint, the difference between these two distributions is associated with the loss of information.
The amount of information available for inference can be calculated using the following three values related to information loss: (i) because H[x] is information of the sensory input and I[x; ϕ] is information stored in the neural network, indicates the information loss in the recognition model ( Figure 2); (ii) the difference between actual and desired (prior) distributions of internal states D KL [p(ϕ)||p * (ϕ)] quantifies the information loss for inferring internal states using the prior (i.e., blind state separation). This is a common approach used in BSS methods [33][34][35][36]; and (iii) the difference between distributions of the actual reconstruction error and the reconstruction error under the given model D KL [p(x|ϕ)||p * (x|ϕ)] p(ϕ) quantifies the information loss for representing inputs using internal states. Therefore, by subtracting these three values from H[x], a mutual-information-like measure representing the inference capability is obtained: which is called utilizable information in this study. This utilizable information X[x; ϕ] is defined by replacing p(x, ϕ) in I[x; ϕ] with p * (x, ϕ), immediately yielding Hence, F represents the gap between the amount of information stored in the neural network and the amount that is available for inference, which is equivalent to the information loss in the generative model. Note that the sum of losses in the recognition and generative models H Furthermore, X[x; ϕ] is transformed into where is the so-called reconstruction error, which is similar to the reconstruction error for principal component analysis (PCA) [49], while is a generalization of Amari's cost function for independent component analysis (ICA) [50]. PCA is one of the most popular dimensionality reduction methods. It is used to remove background noise and extract important features from sensory inputs [49,51]. In contrast, ICA is a BSS method used to decompose a mixture set of sensory inputs into independent hidden sources [34,36,50,52,53]. Theoreticians hypothesize that the PCA-and ICA-like learning underlies BSS in the brain [3]. This kind of extraction of the hidden representation is also an important problem in machine learning [54,55]. Equation (21) indicates that X[x; ϕ] consists of PCA-and ICA-like parts, i.e., maximization of X[x; ϕ] can perform both dimensionality reduction and BSS (Figure 2). Their relationship is discussed in the next section.

Comparison between the Free-Energy Principle and Related Theories
In this section, the FEP is compared with other theories. As described in the Methods, the aim of the infomax principle is to maximize mutual information I[x; ϕ] (Equation (13)), while the aim of the FEP is to minimize free energy expectation F (Equation (18)), while maximization of utilizable information X[x; ϕ] (Equation (19)) means to do both of them simultaneously.

Infomax Principle
The generative process and the recognition and generative models defined in Equations (1)-(3) are assumed. For the sake of simplicity, let us suppose W, V, and γ follow Dirac's delta functions; then, the goal of the infomax principle is simplified to maximization of the mutual information between the sensory inputs x and the neural outputs u: Here W, V, and γ are still variables, and W is optimized according to the learning while V and γ do not directly contribute to minimization of I[x; u|W]. For the sake of simplicity, let us suppose dim(x) ≥ dim(u) and a linear recognition model u = g(x) = Wx, with full-rank matrix W. As H[u|x, W] = const. is usually assumed and u has an infinite range, I[x; u|W] = H[u|W] + const. monotonically increases as the variance of u increases. Thus, I[x; u|W] without any constraint is insufficient for deriving learning algorithms for PCA or ICA. To perform PCA and ICA based on the infomax principle, one may consider mutual information between the sensory inputs and the nonlinearly transformed neural outputs ψ(u) = (ψ(u 1 ), . . . , ψ(u N )) T with an injective nonlinear function ψ(·). This mutual information is given by: When nonlinear neural outputs have a finite range (e.g., between 0 and 1), the variance of u should be maintained in the appropriate range. The infomax-based ICA [52,53] is formulated based on this constraint. From p(ψ(u)|W) = |∂u/∂ψ(u)|p(u|W) Since H[ψ(u)|x, W] = const. holds, Equation (25) becomes: This is the cost function that is usually considered in the studies on the infomax-based ICA [52,53]. The following section shows that PCA and ICA are performed by the maximization of Equation (26) as well as the FEP.

Principal Component Analysis
Both the infomax principle and FEP yield a cost function of PCA. One of the most popular data compression methods, PCA is defined by minimization of the error when the inputs are reconstructed from the compressed representation (i.e., u in this study) [49]. It is known that PCA is derived from the infomax principle under a constraint on the internal states. Although maximization of the mutual information between x and u under the orthonormal constraint on W is usually considered [29], here let us consider another solution. Suppose dim(x) > dim(u), V = W T , and log ψ (u i ) = u 2 i /2 + const.
From Equation (24) The first term of Equation (27) is maximized if WW T = I holds (i.e., if W is an orthogonal matrix; here, a coarse graining with a finite resolution of W is supposed). To maximize the second term, outputs u need to be involved in a subspace spanned by the first to the N-th major principal components of x. Therefore, maximization of Equation (27) performs PCA.
Further, PCA is also derived by minimization of L X (Equation (22)), under the assumption that the reconstruction error follows a Gaussian distribution p * (x|ϕ) = p * (x|u, W, V, γ) = N [x; W T u, γ −1 I].
Here, γ > 0 is a scalar hyper-parameter that scales the precision of the reconstruction error. Hence, the cost function is given by: When γ is fixed, the derivative of Equation (28) with respect to W gives the update rule for the least mean square error PCA [49]. As this cost function quantifies the magnitude of the reconstruction error, the algorithm that minimizes Equation (28) yields the low-dimensional compressed representation that minimizes the loss incurred in reconstructing the sensory inputs. This algorithm is the same as Oja's subspace rule [51], up to an additional term that does not essentially change its behavior (see, e.g., [56] for a comparison between them). The L X here is also in the same form as the cost function for an auto-encoder [54].
Moreover, when the priors of u, W, V, and γ are flat, − log p * (u|W) p(u,W) and D KL [p(W, V, γ)||p * (W, V, γ)] are constants with respect to u, W, V, and γ, because p(W, V, γ) is supposed to be a delta function. Hence, the free energy expectation (Equation (18)) becomes F = L X − H[x|ϕ] − H[u|W] = L X + const., where const. is a constant with respect to u, W, and V. In this case, the optimization of W gives the minimum of F because u and V are determined by W while γ is fixed. Thus, under this condition, F is equivalent to the cost function of the least mean square error PCA.

Independent Component Analysis
It is known that ICA yields independent representation of input data by maximizing the independence between the outputs [52,53]. Thus, ICA reduces the redundancy and yields an efficient representation. When sensory inputs are generated from hidden sources, representing the hidden sources is usually the most efficient representation. Both the infomax principle and FEP yield a cost function of ICA. Let us suppose that sources s 1 , . . . , s N independently follow an identical distribution p 0 (s i |λ). The infomax-based ICA is derived from Equation (26) [52,53]. If ψ(u i ) is defined to satisfy ψ (u i ) = p 0 (u i |γ), negative mutual information −I[x; ψ(u)|W] becomes the KLD between the actual and prior distributions up to a constant term, The L A here is known as Amari's ICA cost function [50], which is a reduction of (23). While both −I[x; ψ(u)|W] and L A provide the same gradient descent rule, formulating I[x; ψ(u)|W] requires nonlinearly transformed neural outputs ψ(u). By contrast, L A straightforwardly represents that ICA is performed by minimization of the KLD between p(u|W) and p * (u|γ) = p 0 (u|γ). Indeed, if dim(u) = dim(x) = N, the background noise is small, and the priors of W, V, and γ are flat, we obtain F = D KL [p(u|W)||p * (u|γ)] p(W,V,γ) = L A . Therefore, ICA is a subset of the inference problem considered in the FEP, and the derivation from the FEP is simpler, although both the infomax principle and FEP yield the same ICA algorithm.
Furthermore, when dim(x) > dim(u), minimization of F can perform both dimensionality reduction and BSS. When the priors of W, V, and γ are flat, free energy expectation (Equation (18)) approximately becomes F ≈ L X + L A + const. = −X[x; u|W, V, γ] + const. Here, γ is fixed so that const. is a constant with respect to x, u, W and V. Conditional entropy H[x|u, W] is ignored in the calculation because it is typically of a smaller order than L X when Σ(γ) is not fine-tuned. As γ parameterizes the precision of the reconstruction error, it controls the ratio of PCA to ICA. Hence, as γ decreases to zero, the solution shifts from a PCA-like to an ICA-like solution.
Unlike the case with the scalar γ described above, if Σ (γ) is fine-tuned by high-dimensional γ to minimize F, Σ = T p(x,ϕ) is obtained. Under this condition, L X is equal to H[x|u, W] up to a constant term, and thereby, F = L A + const. is obtained. This indicates that F consists only of the ICA part. These comparisons suggest that low-dimensional γ is better for performing noise reduction than high-dimensional γ.

Simulation and Results
The difference between the infomax principle and the FEP is illustrated by a simple simulation using a linear generative process and a linear neural network ( Figure 3). For simplification, it is assumed that u quickly converge to u = Wx compared to the change of s (adiabatic approximation). For the results shown in Figure 3, s denotes two-dimensional hidden sources following an identical Laplace distribution with zero mean and unit variance; x denotes four-dimensional sensory inputs; u denotes two-dimensional neural outputs; z denotes four-dimensional background Gaussian noises following N [z; 0, Σ z ]; θ denotes a 4 × 2-dimensional mixing matrix; W is a 2 × 4-dimensional synaptic strength matrix for the bottom-up path; and V is a 4 × 2-dimensional synaptic strength matrix for the top-down path. The priors of W, V, and γ are supposed to be flat as in Section 3. Sensory inputs are determined by x = θs + z, while neural outputs are determined by u = Wx. The reconstruction error is given by = x − Vu and used to calculate H[x|ϕ] and L A . Horizontal and vertical axes in the figure are conditional entropy H[x|ϕ] (Equation (14)) and free energy expectation F (Equation (18)), respectively. Simulations were conducted 100 times with randomly selected θ and Σ z for each condition. For each simulation, 10 8 random sample points were generated and probability distributions were calculated using the histogram method.
First, when W is randomly chosen and V is defined by V = W T , both H[x|ϕ] and F are scattered (black circles in Figure 3) because neural outputs represent random mixtures of sources and noises.
Next, when W is optimized according to either Equation (27) or (28) under the constraint of V = W T , the neural outputs express the major principal components of the inputs, i.e., the network performs PCA (blue circles in Figure 3). This is the case when H[x|ϕ] is minimized. In contrast, when W, V, and Σ (γ) are optimized according to the FEP (see Equation (18)), the neural outputs represent the independent components that match the prior source distribution; i.e., the network performs BSS or ICA while reducing the reconstruction error (red circles in Figure 3). For linear generative processes, the minimization of F can reliably and accurately perform both dimensionality reduction and BSS because the outputs become independent of each other and match the prior belief if and only if the outputs represent true sources up to permutation and sign-flip. As the utilizable information consists of PCA and ICA cost functions (see Equation (21)), the maximization of X[x; ϕ] leads to a solution that is a compromise between the solutions for the infomax principle and the FEP. Interestingly, the infomax optimization (i.e., PCA) provides a W that makes F closer to zero than random states, which indicates that the infomax optimization contributes to the free energy minimization. Note that, for nonlinear systems, there are many different transformations that make the outputs independent of each other [57]. Hence, there is no guarantee that minimization of F can identify the true sources of nonlinear generative models.
In summary, the aims of the FEP and infomax principle are similar to each other. In particular, when both the sources and noises follow Gaussian distributions, their aims become the same. Conversely, the optimal synaptic weights under the FEP are different from those under the infomax principle when sources follow non-Gaussian distributions. Under this condition, the maximization of the utilizable information leads to a compromise solution between those for the FEP and the infomax principle.

Discussion
In this study, the FEP is linked with the infomax principle, PCA, and ICA. It is more likely that the purpose of a neural network in a biological system is to minimize the surprise of sensory inputs to realize better inference rather than maximize the amount of stored information. For example, the visual input captured by a video camera contributes to the stored information, but this amount of information is not equal to the amount of information available for inference. The surprise expectation represents the difference between actual and inferred observations; the free energy expectation provides the difference between recognition and generative models. Utilizable information is introduced to quantify the inference and generalization capability of sensory inputs. Using this approach, the free energy expectation can be explained as the gap between the information stored in the neural network and that available for inference.
To perform ICA based on the infomax principle, one needs to tune the nonlinearity of the neural outputs to ensure the derivative of the nonlinear I/O function matches the prior distribution. Conversely, under the FEP, ICA is straightforwardly derived from the KLD between the actual probability distribution and the prior distribution of u. Especially, in the absence of background noise and prior knowledge of the parameters and hyper-parameters, the free energy expectation is equivalent to the surprise expectation as well as Amari's ICA cost function, which indicates that ICA is a subproblem of the FEP.
The variational free energy quantifies the gap between the actual probability and the generative model and is a straightforward extension of the cost functions for BSS in the sense that it comprises the cost function for PCA [49] and ICA [50] in some special cases. Apart from that, there are studies that use the gap between the actual probability and the product of the marginal distributions to perform BSS [58] or to evaluate the information loss [59,60]. While the relationship between the product of the marginal distributions and the generative model is non-trivial, the comparison would lead to a deeper understanding about how the information of the external world is encoded by the neural network. In the subsequent work, we would like to see how the FEP and the infomax principle are related to those approaches.
The FEP is a rigorous and promising theory from theoretical and engineering viewpoints because various learning rules are derived from the FEP [14,15]. However, to be a physiologically plausible theory of the brain, the FEP needs to satisfy certain physiological requirements. There are two major requirements: first, physiological evidence that shows the existence of learning or self-organizing processes under the FEP is required. The model structure under the FEP is consistent with the structure of cortical microcircuits [19]. Moreover, in vitro neural networks performing BSS reduce free energy [48]. It is known that the spontaneous prior activity of a visual area enables it to learn the properties of natural pictures [61]. These results suggest the physiological plausibility of the FEP. Nevertheless, further experiments and consideration of information-theoretical optimization under physiological constraints [62] are required to prove the existence of the FEP in the biological brain. Second, the update rule must be a biologically plausible local learning rule, i.e., synaptic strengths must be changed by signals from connected cells or widespread liquid factors. While the synaptic update rule for a discrete system is local [17], the current rule for a continuous system [14] is a non-local rule.
Recently developed biologically-plausible three-factor learning models in which Hebbian learning is mediated by a third modulatory factor [56,[63][64][65] may help reveal the neuronal mechanism underlying unconscious inference. Therefore, it is necessary to investigate how actual neural networks infer the dynamics placed in the background of the sensory input and whether this is consistent with the FEP (see also [66] for the relationship between the FEP and spike-timing dependent plasticity [67,68]). This may help develop a biologically plausible learning algorithm through which an actual neural network might develop its internal model. Characterization of information from a physical viewpoint may also help understand how the brain physically embodies the information [69,70]. In the subsequent work, we would like to investigate this relationship.
In summary, this study investigated the differences between two types of information: information stored in the neural network and information available for inference. It was demonstrated that free energy represents the gap between these two types of information. This result clarifies the difference between the FEP and related theories and can be utilized for understanding unconscious inference from a theoretical viewpoint.