A Technical Critique of Some Parts of the Free Energy Principle

We summarize the original formulation of the free energy principle and highlight some technical issues. We discuss how these issues affect related results involving generalised coordinates and, where appropriate, mention consequences for and reveal, up to now unacknowledged, differences from newer formulations of the free energy principle. In particular, we reveal that various definitions of the “Markov blanket” proposed in different works are not equivalent. We show that crucial steps in the free energy argument, which involve rewriting the equations of motion of systems with Markov blankets, are not generally correct without additional (previously unstated) assumptions. We prove by counterexamples that the original free energy lemma, when taken at face value, is wrong. We show further that this free energy lemma, when it does hold, implies the equality of variational density and ergodic conditional density. The interpretation in terms of Bayesian inference hinges on this point, and we hence conclude that it is not sufficiently justified. Additionally, we highlight that the variational densities presented in newer formulations of the free energy principle and lemma are parametrised by different variables than in older works, leading to a substantially different interpretation of the theory. Note that we only highlight some specific problems in the discussed publications. These problems do not rule out conclusively that the general ideas behind the free energy principle are worth pursuing.


Overview
In [1], it was argued that the internal coordinates of an ergodic random dynamical system with a Markov blanket necessarily appear to engage in active Bayesian inference. Here, we reproduce the argument supporting this interpretation in detail and highlight at which points it faces technical issues. In the course of our critique, we also mention issues of some closely related alternative arguments. In cases where our results have clear consequences for the more recent related publications [2,3], we also mention those. In particular, we point out a conceptual difference in these latter works that has not previously been acknowledged. However, our analysis thereof does not go beyond a few remarks. In an additional section, we discuss the effect of our argument on [4]. The logical structure of the present paper is depicted in Figure 1. We note that the technical issues presented here do not affect the validity of approaches where a (expected) free energy minimizing agent is assumed a priori, as presented in, e.g., [5]. None of [1][2][3][4] make this assumption; they instead aim to identify the conditions under which such agents will emerge within a given stochastic process. We criticize specific formal issues in the latter publications but leave open whether they can be fixed. We now briefly introduce the setting of [1] and then sketch the content of this paper. We now briefly introduce the setting of [1] and then sketch the content of this paper.

Interpretation as Bayesian inference
Step 2 Step 3 Step 3 [2] Step 3, general Step 4 Step 5 Step 6 Step 7 [3] Figure 1. Argument visualization. Numbers labelling edges indicate corresponding steps in this paper. Struck out edges indicate implications that we prove incorrect. The main argument in [1] takes the left path. The box in the top right indicates the relations between Conditions 1 to 3 and their role in [3]. Merged edges indicate a logical AND combination of the parent nodes.
The starting point is a random dynamical system whose evolution is governed by the stochastic differential equation:ẋ where the system state x and vector field f (x) are multi-dimensional and ω is a Gaussian noise term. There is an additional assumption that the system is ergodic, such that the steady state probability density p * (x) is well defined (In the original paper, the ergodic density is simply denoted p(x). We here add a star to highlight that it is a time independent probability density.). In this case, − ln p * (x) plays the role of a potential function, in the sense that f can be formulated in terms of its gradients [6,7]. It is then assumed that there is a coordinate system x = (ψ, s, a, λ) with ψ = (ψ 1 , ..., ψ n ψ ), s = (s 1 , ..., s n s ), a = (a 1 , ..., a n a ), and λ = (λ 1 , ..., λ n λ ), referred to as external, sensory, active, and internal coordinates (these are called "states" in [1]), respectively, such that the following condition holds: Condition 1. The function f (x) can be written as: This particular structure is described as "[formalizing] the dependencies implied by the Markov blanket" [1]. In contrast, more recent works [2,3] formulated the Markov blanket in terms of the statistical dependencies of the ergodic density p * (x) = p * (ψ, s, a, λ). Specifically, the following condition is presented: The ergodic density factorises as: p * (ψ, s, a, λ) = p * (ψ|s, a)p * (λ|s, a)p * (s, a). ( In other words, the internal and external coordinates are independently distributed when conditioned on the sensory and active coordinates. This means we have two different formal expressions of what constitutes a Markov blanket in these publications, and their relationship has not previously been established.
Taking Condition 1 to hold, the argument of [1] then proceeds along the following steps: Step 2 Rewrite the vector field f (ψ, s, a, λ) describing the dynamics of the system in terms of the gradient of the negative logarithm of the ergodic density p * (ψ, s, a, λ) of that system.
Step 3 Rewrite the components f λ (s, a, λ) and f a (s, a, λ) of the vector field f (ψ, s, a, λ) in terms of only partial gradients of the negative logarithm of p * (ψ, s, a, λ).
Step 4 Assert (in the free energy lemma) the existence of a density q(ψ|λ) over the external coordinates ψ parameterized by the internal coordinates λ and that f (ψ, s, a, λ) can again be rewritten, this time in terms of a free energy depending on q(Ψ|λ) (here, and whenever it would otherwise be ambiguous, we use a capitalized Ψ to indicate full distributions, rather than the probability density for a specific value of ψ).
Step 5 Claim that the equivalence of the equations of motion in Step 3 and Step 4 implies that certain partial gradients of the KL divergence between q(Ψ|λ) and the conditional ergodic density p * (Ψ|s, a, λ) must vanish.
Step 7 Interpret: • p * (Ψ|s, a, λ) as a posterior over external coordinates given particular values of sensor, active, and internal coordinates, • q(Ψ|λ) as encoding Bayesian beliefs about the external coordinates by the internal coordinates, and • their equality as the internal coordinates appearing to "solve the problem of Bayesian inference".
In the present paper, we make the following main observations: • The re-expression of Equation (1)  The latter [4] presents an argument almost identical to the one in the original [1]. In Section 8, we discuss how our observations apply to this publication.

Expression via the Gradient of the Ergodic Density
Here, we introduce the expression of the system's dynamics Equation (1) in the form used for the free energy lemma (Lemma 2.1 in [1]). This form expresses the dynamics of the internal and active coordinates of the given ergodic random dynamical system in terms of the gradient of the ergodic density p * (x). In accordance with the results of [7], f (x) is rewritten as (see Equation (2.5) in [1]): where Γ is the diffusion matrix, which we will take to be block diagonal (in [1], and later work such as [2], Γ is taken to be proportional to the identity matrix), and R is an antisymmetric matrix, defined through the relation: with Here, and in all of [1][2][3][4], both Γ and R are assumed constant. We emphasise here that, for general nonlinear models, these matrices can vary with the coordinates and Equation (5) holds only approximately [8,9] (the exact conditions under which these matrices can be chosen to be constant can be found in [9,10] and, for the discrete state case, [11]). Moreover, Equation (4) is derived in the literature under the explicit assumption that the fluctuations ω be Gaussian and Markov [6,7]. For the counterexamples we present here, we restrict ourselves to the class of Ornstein-Uhlenbeck processes, for which R and Γ are always constant, and the ergodic density p * (x) = p * (ψ, s, a, λ) is necessarily a multivariate Gaussian with zero mean. Specifically, following [7], where (ψ, s, a, λ) is a row vector and Z is a suitable normalisation constant. From Equation (4), it can be seen that, though we emphasise here that strict relations between M and U can only be made because of the assumption that Γ and R are coordinate independent [12]. This concludes Step 2.
Before moving on to Step 3, we note that, under the assumptions implicit in Step 2, we can express Conditions 1 and 2 in terms of the matrices M and U (in the nonlinear case, these matrices can still be defined in terms of the derivatives of the force vector field and potential, respectively; however, they will be generally coordinate-dependent, even when Γ and R are not [8]). Firstly, since it effectively states that with M αβ a block sub-matrix of M in general. Secondly, because of the multivariate Gaussian nature of p * (ψ, s, a, λ), the dependencies of conditional distributions are encoded in the inverse U of the covariance matrix; we therefore have that: where U αβ is a block sub-matrix of U. These implications bring us to our first observation: Neither Condition 1 (the vector field dependency structure) nor Condition 2 (conditional independence in the ergodic distribution) imply the other: Condition 1 Condition 2.
Proof. In Appendix A, we provide direct counterexamples, using the equivalent constraints on the matrices M and U in Equations (9) and (10), for the implication in either direction. That is, there exists a system obeying Condition 1 that does not obey Condition 2 (proving Equation (11)), and there exists one obeying Condition 2 that does not obey Condition 1 (proving Equation (12)).
Henceforth, unless otherwise stated, we will assume both Conditions 1 and 2. Any implications that fail to hold in this special case cannot hold generally.

Re-Expression Using only Partial Gradients
For Step 3, we focus on the components f λ = ( f λ 1 , ..., f n λ ) and f a = ( f a 1 , ..., f n a ) of f . Without loss of generality, we can rewrite them from Equation (4) as: where Γ nm (R nm ) is the block of Γ (R) connecting derivatives with respect to the m coordinates to the time derivatives of the n coordinates. The expectation value with respect to p * (ψ|s, a, λ) leaves the left-hand side of these equations unchanged. A few manipulations ( [2] cf. Equation (12.14), p. 129) reveal that, on the right-hand side, this leads to the ergodic density p * (ψ, s, a, λ) being replaced by the marginalised ergodic density p * (s, a, λ) so that we get: f λ (s, a, λ) = R λψ · ∇ ψ + R λs · ∇ s + (Γ λλ + R λλ ) · ∇ λ +R λa · ∇ a ) ln p * (s, a, λ).
Since ∇ ψ ln p * (s, a, λ) = 0, the terms involving ∇ ψ drop out: We are not aware of how to further simplify this equation without additional assumptions. However, in (Equations (2.5) and (2.6) of [1]), all of the off-diagonal terms are implicitly assumed to vanish, i.e., Equation (4) is equated with: This equation is the result of Step 3.
More recently (Appendix B of [2]), a more detailed discussion of Equation (4) was presented, where it was claimed that Condition 1 implies Condition 2 (cf. our Observation 1) along with the following simplification of Equations (17) and (18) However, Equations (21) and (22) are still provably less general than Equations (13) and (14), even when both Conditions 1 and 2 are satisfied.
Proof. By counterexample, see Appendix B. There, we show explicitly that a model satisfying the above assumptions does not satisfy the equations in question.
In order to arrive at Equations (21) and (22) from Equations (17) and (18) in general, one must remove the offending "solenoidal flow" terms by fiat. That is, one assumes R as = R λs = 0. In [2], Equation (12.4), the following, even stronger, condition was assumed as an alternative starting point (along with Condition 2): Condition 3. The blocks of the R matrix appearing in Equation (4) coupling (s, a) coordinates to λ and ψ coordinates and ψ coordinates to λ coordinates vanish, i.e., This is claimed to imply M ψλ = M λψ = 0, but not the full Condition 1. However, in [3], both Conditions 1 and 3 were assumed (along with R as = 0). This prompts our next observation. Observation 3. In a system satisfying both Conditions 1 and 3, the internal coordinates cannot be directly influenced by the sensory coordinates: f λ (s, a, λ) = f λ (a, λ), and the external coordinates cannot be directly influenced by the active coordinates: f ψ (ψ, s, a) = f ψ (ψ, s). (5), it follows that:

Proof. From Equation
with the inverse replaced by a pseudoinverse if Γ − R is not invertible. Therefore, if Γ αβ = δ αβ Γ αα and R αβ = δ αβ R αα for blocks of coordinates labelled by α and β, then: and Condition 3 implies that only the nonzero blocks of R are R ψψ , R ss , R sa , R as , R aa , and R λλ , and Γ is assumed to be block diagonal. As noted in Equation (9), Condition 1 requires that M aψ = M λψ = M sλ = M ψλ = 0. Through Equation (25), these together imply that M λs = M ψa = 0, and hence that: as shown.
In this case, the four sets of coordinates interact in a chain, and it is questionable whether the s and a coordinates can be meaningfully interpreted, respectively, as sensory inputs to the internal coordinates or their boundary-mediated influence on the external coordinates.

Free Energy Lemma
The relation of the dynamics of the internal coordinates to Bayesian beliefs is made by introducing a density (called the variational density) q(Ψ|λ) that is then interpreted as encoding a Bayesian belief. It is parameterized by the internal coordinates λ and claimed to be "arbitrary". We take this "at face value" and consider q(Ψ|λ) to be parameterized only by λ and, therefore, to be independent of (s, a). (We note that there is a convention in the literature on variational Bayesian inference, e.g., in [13], to drop the observed variables/data in the variational density. It is possible that in [1], (s, a) was seen as observed variables and dropped from the variational density q(Ψ|λ) as in this convention. However, the reason that dropping the observed variables is justified in the established convention is that those observed variables are fixed throughout the minimization of the variational free energy and the parameters of the variational density do not influence the observed data in any way. In other words, the variational density is optimized for a single data point. In [1], the data point was continuously changing and partially doing so with dependence on the parameter λ asȧ = f a (s, a, λ). These differences and their consequences are non-trivial and beyond the scope of this paper, so we assume that the variational density does not depend on (s, a).) If q(Ψ|λ) is allowed to depend on (s, a), Observation 4 does not apply, and the free energy lemma is made trivially true by setting q(ψ|s, a, λ) := p * (ψ|s, a, λ). The existence of the variational density q(Ψ|λ) is asserted by the free energy lemma (see Lemma 2.1 in [1]) (Explicitly, the free energy lemma asserts the existence of a free energy F(s, a, λ) in terms of which f (ψ, s, a, λ) can be expressed and not the existence of q(Ψ|λ). However, since the free energy is defined as a functional of q(Ψ|λ), it exists if and only if a suitable q(Ψ|λ) exists.).
More precisely, the free energy lemma (and Step 4) asserts that for every ergodic density (equivalently as expressed in [1], for every Gibbs energy G(x) := − ln p * (ψ, s, a, λ)) p * (ψ, s, a, λ) of a system obeying Equations (19) and (20), there is a free energy F(s, a, λ), defined as: in terms of the "posterior density" p * (Ψ|s, a, λ) (here, we keep the conditioning argument λ, as in [1], and do not explicitly assume Condition 2, though our conclusions are unaffected by it), such that Equations (19) and (20) can be rewritten as: It is worth considering what a proof of the free energy lemma could look like. A proof of the existence of a free energy (and therefore of the free energy lemma) would need to show that, for every system satisfying the given assumptions, there always exists a q(Ψ|λ) such that the right-hand sides of Equations (29) and (30) are equal to the right-hand sides of Equations (19) and (20). Expanding Equations (29) and (30) using (28) leads to: For the equality of the right-hand sides to those of Equations (19) and (20), we need: In other words, these equations say that the free energy lemma holds if any of the following three conditions (of strictly increasing strengths) are given: 1.
There is a q(Ψ|λ) such that the partial gradients ∇ a and ∇ λ of the KL divergence between the variational density and the conditional ergodic density are elements of the nullspaces of (Γ + R) aa and (Γ + R) λλ , respectively.

2.
There is a q(Ψ|λ) such that the gradients of the KL divergence to p * (Ψ|s, a, λ) are equal to the nullvector: Then, they are always elements of the nullspaces of (Γ + R) aa and (Γ + R) λλ , respectively.
The free energy lemma can then be proven by showing that one of these three cases follows from the conditions of the lemma. However, no attempt was made in [1] to establish this. Instead, the given proof discusses the purported consequences of the existence of a suitable q(Ψ|λ). These will be discussed in Steps 5 and 6.
Even if the free energy lemma does not hold for systems obeying Equations (19) and (20), one might expect that the systems instead only satisfy the more general Equations (21) and (22) or the most general Equations (17) and (18). For these systems, the free energy lemma would require that there is a q(Ψ|λ) such that: or: hold, respectively. However, we find this not to be the case in general.
Proof. In Appendix C, we derive a set of conditions on the R and U matrices and on the putative variational density q(Ψ|λ), which follow from each of the pairs of equations in Cases (i-iii). We show that, in general, each pair leads to a contradiction, and in each case, we provide a counterexample that falls into the according system class.
Before proceeding, we note that later works presented an alternative version of the free energy lemma, where the conditioning argument of q(Ψ|λ) was replaced by the most likely value of λ conditional on the (s, a) coordinates [2,3]. We here concern ourselves with the version apparent in [1], where q(Ψ|λ) is parametrised by the internal states themselves, but we briefly comment on the interpretation of the alternative approach in Step 7.

Vanishing Gradients
As mentioned in Step 4, the proof of the free energy lemma in [1] only discussed its consequences. The first proposed consequence is that expressing the vector field in terms of a free energy as in Equations (29) and (30) "requires" that the gradients with respect to a and λ of the KL divergence vanish, i.e., that Equations (35) and (36) hold.
We mentioned in Step 4 that the implication in the opposite direction holds. This can be seen from Equations (33) and (34). However, if the nullspace of (Γ + R) aa or (Γ + R) λλ is non-trivial, then the gradient may be a non-zero element of this subspace and Equations (29) and (30) will still hold. In that case, the vanishing gradients would not be necessary for the free energy lemma.
The conditions under which a non-trivial nullspace exists were discussed in [7]. In short, the nullspace is guaranteed to be trivial in the special case where Γ is positive definite. Whether or not ergodic systems with a Markov blanket can ever admit a non-trivial nullspace, and hence divergences in Equations (31) and (32) with non-vanishing gradients, is not immediately clear. However, in order to establish the necessity of Equations (35) and (36), this remains to be proven.

Equality of Q(Ψ|λ) and P * (Ψ|s, a, λ)
The proof of the free energy lemma in [1] also proposes that the vanishing of the gradients of the KL divergence, of the variational density q(Ψ|λ) from the conditional ergodic density p * (Ψ|s, a, λ), implies the equality of these densities. We mentioned in Equations (5) that the implication in the opposite direction holds. This can also be seen from Equations (33) and (34). Concerning the implication in the direction proposed by [1], let us now assume that for a given system of Equations (19) and (20) holds, a variational density q(Ψ|λ) does exist, and the gradients of the KL divergence of the variational and ergodic densities vanish, i.e., Equations (35) and (36) hold. Then, consider the argument by [1] in this direct quote (comments in square brackets by us): "However, Equation (2.6) [Equations (19) and (20) above] requires the gradients of the divergence to be zero [Equations (35) and (36)], which means the divergence must be minimized with respect to internal states. This means that the variational and posterior densities must be equal: In other words, the flow of internal and active states minimizes free energy, rendering the variational density equivalent to the posterior density over external states." The first problem in the above quote is that the minimization of the divergence does not follow from the vanishing gradients. On the contrary, since Equations (35) and (36) must hold for all (s, a, λ), the KL divergence: cannot depend on (λ, a); it therefore has no extremum (and thus no minimum) with respect to either of these coordinates.
The second problem pertains to the identification of the two distributions at a minimum. In general, if we try to find the minimum of a KL divergence between a given probability density p 1 (Y) and a family of densities p 2 (Y|θ) parameterized by θ, then the lowest possible value of zero is achieved only if there is a parameter θ 1 such that p 2 (Y|θ 1 ) = p 1 (Y). If there is no such θ 1 , then the minimum value will be larger than zero. Therefore, even if the divergence were minimized, it would not need to be zero. More generally, the divergence K(s) need not be zero for any value of s.
There is therefore no satisfactory reason given why the variational density q(Ψ|λ) and the posterior density p * (Ψ|s, a, λ) should be equal or have low KL divergence. In fact, they need not be (Note that, since any q(Ψ|λ) that does not depend on (s, a) is an element of the set of those that do, Observation 5 remains true for the case where we allow this dependence. In that case, the free energy lemma holds because we can set q(Ψ|s, a, λ) := p * (Ψ|s, a, λ), and thus, a q exists for which the densities are actually equal. However, the claim here is that for every q that obeys the conditions in Observation 5, we must have equality.). then there is no c ≥ 0 for which it can be guaranteed that: In particular, it does not follow from these conditions that:  (22), and the nullspace of the associated Γ + R is trivial. We identify a set of possible q(Ψ|λ) satisfying Equations (29) and (30), which implies that the gradients of the KL divergence between those q(Ψ|λ) and p * (Ψ|s, a, λ) vanish, i.e., Equations (35) and (36) hold. We then demonstrate that for the q(Ψ|λ) in this set, the value of the KL divergence to p * (Ψ|s, a, λ) can be arbitrarily large.

Interpretation
Finally, we turn our attention to the interpretation in terms of Bayesian inference, i.e., Step 7. We again quote directly from [1]: Because (by Gibbs inequality) this divergence [ D KL [q(ψ|λ)||p * (ψ|s, a, λ)]] cannot be less than zero, the internal flow will appear to have minimized the divergence between the variational and posterior density. In other words, the internal states will appear to have solved the problem of Bayesian inference by encoding posterior beliefs about hidden (external) states, under a generative model provided by the Gibbs energy.
We showed that, in general, there is no suitable variational density that is only parameterized by the internal coordinate λ. We then showed that, even if there is a suitable variational density (including those parameterized by all of (s, a, λ)), it can be arbitrarily different from the posterior density. Since the arguments for the internal flow appearing to minimize the divergence between variational and posterior density are therefore incorrect, there is no reason why the internal states should appear to have solved the problem of Bayesian inference.
As mentioned in Step 4, some newer works (e.g., [2,3]) formulated a different free energy principle, where the variational density of beliefs is parametrised not by the internal coordinates λ, but byλ(s, a) = arg max λ p * (λ|s, a), the most likely value of the internal coordinates given the sensory and active ones. In this case, Observations 4 and 5 do not apply. However, the new parametersλ(s, a) are strictly a function of the sensory and active coordinates. This means we have a Markov chain (with capitalisations indicating random variables associated with the corresponding lower case coordinates (or functions of coordinates)) Λ → (S, A) →Λ and, by the data processing inequality [14], the mutual information between the both sensory and active coordinates and the belief parameterλ upper bounds that are between the internal coordinates and the belief parameter. It is therefore not clear to what extent the internal coordinates λ, rather than the active and sensory coordinates (s, a) themselves, can be said to be encoding beliefs about the external coordinates. Note also that, on any given trajectory, unless the distribution p * (λ|s, a) is sufficiently peaked and unimodal, the internal coordinates are not guaranteed to spend most of their time close to their most likely conditional value, and (by definition if Condition 2 holds) they will not be better predictors of the external coordinates than those in the Markov blanket.
Generally, λ =λ, andλ is the solution to an optimization problem that is assumed to be solved in these later works. Using this optimized variable to parametrise beliefs is therefore a considerable departure from [1]. Contrary to the impression created by the way it was referenced in [2,3], the older theory in [1] should be clearly distinguished from the newer ones in these more recent papers.

Consequences for Friston, K. et al. 2014
Reference [4] argued for the same interpretation as [1], but there were some differences in the argument.
The differences were the following: • In [4], Equation (1) was formulated for "generalized states", which we refer to here as generalized coordinates. This means that the variable x is replaced by a multidimensional variable denotedx = (x, x , x , ...). • The Markov blanket structure was not explicitly defined via Equation (2). Formally, it was introduced directly (see [4] Equation (10)) in a less general form corresponding to Equations (19) and (20) (at the same time, [1] is referenced in connection to the Markov blanket so there seems to be no intention to replace the original definition with the stronger one). Therefore, our observations concerning Steps 2 to 4 are not directly relevant to this paper.
• The internal coordinate λ was renamed to r, and the role of matrix R was played by the matrix −Q. • The proof of the free energy lemma given in [4] was different. It (implicitly) suggested setting the variational density equal to the ergodic conditional posterior. • The proof of the free energy lemma no longer contained the proposition that the gradient of the KL divergence of the variational density and the ergodic conditional density vanish, i.e., Step 5.

•
The proof also no longer contained the claim that the vanishing gradients of the KL divergence of the variational density and the ergodic conditional density imply the equality of those densities, i.e., Step 6 was not present.
The interpretation in terms of Bayesian inference was unchanged and still relied on the equality of the variational and the ergodic conditional density.
Since there were no explicit generalized coordinate versions of Steps 2, 3, 5 and 6 in [4], we do not discuss those steps here. We only disprove the free energy lemma and the claim that when the free energy lemma holds, the variational and ergodic conditional density become equal. For this, we present a way to translate the counterexamples used in Observations 4 and 5 into counterexamples in generalized coordinates. The interpretation in terms of Bayesian inference given in [4] is therefore equally as unjustified as the one in [1].
The free energy lemma then requires that there exists q(Ψ|r) such that the KL divergence between p * (Ψ|s,ã,r) vanishes. Without going into further details of the difference between the proof in [4] and that in [1], we can prove the former wrong by translating the counterexample used for the latter into generalised coordinates.

Observation 6.
There is a general way to translate a system in ordinary coordinates into a system of generalised coordinates that corresponds to an infinite number of independent copies of the original system. This means all properties of the original system (e.g., linearity, ergodicity, the Gaussian and Markovian property of the noise, Conditions 1 and 2, the properties of Γ, R, U) are preserved during this translation.

Proof. By construction, see Appendix E.
This implies that the counterexamples used in proving Observations 4 and 5 directly translate to the setting of the generalised coordinates. The free energy lemma is therefore also wrong for generalised coordinates, and the variational density q(Ψ|r) is not "ensured" [4] to be equal to the conditional ergodic density p * (Ψ|s,ã,r).

Conclusions
We find that the two different Markov blanket conditions proposed in [1][2][3] are independent of each other. We then show that under both of those Markov blanket conditions, among the six steps contained in the argument in [1], three do not hold independently of each other. We also show that fixing the second of those steps (Step 3) does not provide a valid alternative. The line of reasoning of [1] therefore does not support its claim that the internal coordinates of a Markov blanket "appear to have solved the problem of Bayesian inference by encoding posterior beliefs about hidden (external) [coordinates], ...". We also show that using generalised coordinates as in [4] does not remedy the situation. Additionally, we identify a technical error in [2] and an interpretational issue resulting from possibly too strong assumptions (both Conditions 1 and 3) in [3]. We also highlight that the latter publications both argued that it is the most likely internal coordinates given sensory and active coordinates that encode posterior beliefs about external states instead of the internal coordinates themselves. The resulting free energy principle and lemma are therefore a different proposal. This is not subject to our technical critique.  Acknowledgments: All authors are grateful to Karl Friston and Thomas Parr for constructive feedback on an earlier version of this work. We also want to thank Danijar Hafner for pointing us to [9]. Martin Biehl wants to thank Yen Yu for helpful discussions on generalized coordinates.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Appendix A. Counterexamples for Observation 1
Consider a four-dimensional linear system obeying Equation (1) for which there are coordinates x = (ψ, s, a, λ) with n ψ = n s = n a = n λ = 1 and: with the parametrisation: From Equation (9), it is clear that the system obeys Condition 1 if m 3 = 0. In this case, taking Γ to be the identity matrix, it is possible to show that: For fixed, finite m 2 , this is zero only for a few discrete values of m 1 , such as m 1 = m 2 − 2; that it is generically non-zero proves Equation (11). As a concrete example, the following: has: 137(a 2 + s 2 ) + 118(ψ 2 + λ 2 ) +127(ψs + aλ) + 93(ψa + sλ) + 206as − 36ψλ , which does not conditionally factorise.
Taking the same parametrisation as in Equation (A2) and fixing m 1 = m 2 = − 1 /2, we can search for a non-zero value of m 3 that leads to U ψλ = 0 (equivalent to Condition 2 through Equation (10)). We find such a value in the real root c −0.08 of the quintic equation 8c 5 − 4c 4 − 6c 3 + 31c 2 + 40c + 3 = 0; that is, with: which has a non-zero determinant (i.e., the dynamics is ergodic) and an ergodic density satisfying Condition 2. This proves Equation (12).

Appendix B. Counterexample for Step 3
Here, we consider a linear system, as in the previous Appendix. We again assume Γ equal to the identity matrix and choose a force matrix of the form: which explicitly satisfies Condition 1 and has full rank such that the system is ergodic. Using Equation (5), this leads to: which shows that all entries or R that can be non-zero for an anti-symmetric matrix are non-zero. For the marginal ergodic density, we find: The difference between the right-hand sides of Equations (17) and (19) is: which shows that Equation (19) is wrong in this example and therefore not generally equivalent to Equation (17). Similarly, computing the difference between the right-hand sides of Equations (18) and (20), one finds: and hence, Equation (20) is also incorrect in general. Performing the same comparison for the difference between the general expression in Equations (17) and (18) and the expressions taken from [2], one finds: for the difference between the right-hand sides of Equations (17) and (21), and: for the difference between the right-hand sides of Equations (18) and (22). Therefore, Equations (21) and (22) are also incorrect in general, even when Conditions 1 and 2 both hold.

Appendix C. Counterexamples for Step 4
We saw in Appendix B that Equations (19) and (20) are not generally equivalent to Equation (4), even when Conditions 1 and 2 hold simultaneously. We now show that if we instead use Equations (17) and (18), which are generally equivalent to Equation (4), the free energy lemma does not hold in general.
The original free energy lemma requires that (see Equations (31) and (32)): Replacing the partial gradient in Equations (29) and (30) with the full gradient and including the entire matrix (Γ + R) lead to the corresponding requirement for the more general case: (R as · ∇ s + (Γ aa + R aa ) · ∇ a + R aλ · ∇ λ ) D KL [q(Ψ|λ)||p * (Ψ|s, a, λ)] = 0 (A21) Similarly, the version based on the equations taken from [2] implies: Using the rules of Gaussian integration, we can write the logarithm of the conditional ergodic density as: with C a constant (and remembering each of ψ, s, a, and λ is a vector of coordinates in general). We can then expand the derivatives of the KL divergence to express them in terms of the coordinates: ∇ a D KL [q(Ψ|λ)||p * (ψ|s, a, λ)] = − dψ q(ψ|λ)∇ a ln p * (ψ|s, a, λ) with g(ψ) q(Ψ|λ) := dψ q(ψ|λ)g(ψ) and H the Shannon entropy. Substituting Equations (A27) and (A28) into Equations (A19) and (A20) leads to: and: Since these must hold for all values of the coordinates, they put strong requirements on the U and R matrices. Specifically, In other words, since U ψψ and Γ aa must be nonzero for the dynamics to be ergodic, it must be that U ψa = 0 (This is equivalent to p * (Ψ|s, a, λ) = p * (Ψ|s, λ). Therefore, if Condition 2 also holds, we must have p * (Ψ|s, a, λ) = p * (Ψ|s) in order for there to be a suitable q(Ψ|λ).). Specifically, consider the system specified by the force matrix: Here, M is full rank, so the system is ergodic; clearly, it also satisfies Condition 1 due to the structure of M. Since R as = R aλ = R λs = 0, it obeys Equations (19) and (20), and since Q ψλ = 0, it also obeys Condition 2. Additionally, we find U ψa = − 4 /17, which is a contradiction.
If we now perform the same procedure for Equations (A23) and (A24), we arrive at the following conditions on the gradient of the variational density: R aλ ∇ λ ψ q(Ψ|λ) = − (R aa + Γ aa )U aψ + R aλ U λψ U −1 ψψ . and: Even when Condition 2 holds and U ψλ = 0, these will be inconsistent in general. As a specific counterexample, take the system with force matrix: with corresponding: and: This model is ergodic (full rank U), and it satisfies both Conditions 1 and 2. Moreover, the forces satisfy Equations (21) and (22). However, substituting the relevant elements of the U and R matrices into Equation (A44), we find: but doing the same for Equation (A45) gives: which is a contradiction.

Appendix D. Counterexample for Step 6
Here, we provide an example system for which Conditions 1 and 2, as well as Steps 2 to 5 are valid, but Step 6 fails. We use a system with: (A52) This system is ergodic, satisfies Condition 1, and as we will will see, satisfies Equations (19) and (20)  and from Equation (8): which means that Condition 2 is also satisfied.
This leads to the ergodic density: p * (ψ, s, a, λ) = √ 5 16π 2 e − 1 2 (ψ 2 −ψs+s 2 −sa+a 2 −aλ+λ 2 ) (A55) which can be used to check that Equations (19) and (20) hold for this example. The conditional ergodic density is: (A56) If we now define q(ψ|λ) = q(ψ) = exp(−(ψ − µ) 2 /2)/ √ 2π as a Gaussian distribution with mean µ and variance one, we can compute the KL divergence to get: Clearly, for this choice of q(ψ|λ), the gradients with respect to a and λ of the KL divergence vanish everywhere (Equations (35) and (36) hold). This also means we can express f a , f λ in terms of a free energy, i.e., the free energy lemma holds for this system. However, for any proposed bound c ≥ 0 on the KL divergence, there is a value of s for which it is exceeded, whatever the choice of µ. Moreover, we can choose a µ such that the KL divergence is larger than any given c, even when s = 0.