On Predictive Planning and Counterfactual Learning in Active Inference

Given the rapid advancement of artificial intelligence, understanding the foundations of intelligent behaviour is increasingly important. Active inference, regarded as a general theory of behaviour, offers a principled approach to probing the basis of sophistication in planning and decision-making. This paper examines two decision-making schemes in active inference based on “planning” and “learning from experience”. Furthermore, we also introduce a mixed model that navigates the data complexity trade-off between these strategies, leveraging the strengths of both to facilitate balanced decision-making. We evaluate our proposed model in a challenging grid-world scenario that requires adaptability from the agent. Additionally, our model provides the opportunity to analyse the evolution of various parameters, offering valuable insights and contributing to an explainable framework for intelligent decision-making.


Introduction
Defining and thereby separating the intelligent "agent" from its embodied "environment", which then provides feedback to the agent, is crucial to model intelligent behaviour.Popular approaches, like reinforcement learning (RL), heavily employ such models containing agent-environment loops, which boils down the problem to agent(s) trying to maximise reward in the given uncertain environment Sutton and Barto [2018].
Active inference has emerged in neuroscience as a biologically plausible framework Friston [2010], which adopts a different approach to modelling intelligent behaviour compared to other contemporary methods like RL.In the active inference framework, an agent accumulates and maximises the model evidence during its lifetime to perceive, learn, and make decisions Da Costa et al. [2020], Sajid et al. [2021], Millidge et al. [2020].However, maximising the model evidence becomes challenging when the agent encounters a highly 'entropic' observation (i.e. an unexpected observation) concerning the agent's generative (world) model Da Costa et al. [2020], Sajid et al. [2021], Millidge et al. [2020].This seemingly intractable objective of maximising model evidence (or minimising the entropy of encountered observations) is achievable by minimising an upper bound on the entropy of observations, called variational free energy Da Costa et al. [2020], Sajid et al. [2021].Given this general foundation, active inference Friston et al. [2017] offers excellent flexibility in defining the generative model structure for a given problem and has attracted much attention in various domainsKuchling et al. [2020], Deane et al. [2020].
In this work, we develop an efficient decision-making scheme based on active inference by combining 'planning' and 'learning from experience'.After a general introduction to generative world models in the next section, we take a arXiv:2403.12417v1[cs.AI] 19 Mar 2024 closer look at the decision-making aspect of active inference.Then, we summarise two dominant approaches in active inference literature: the first based on planning (Section 2.3.1), and the second based on counterfactual learning (cf.Section 2.3.2).We compare the computational complexity and data efficiency (cf.Section 3.2) of these two existing active inference schemes and propose a mixed or hybrid model that balances these two complementary schemes (Section 3.3).Our proposed hybrid model not only performs well in an environment that demands adaptability (in Section 3.5) but also provides insights regarding the explainability of decision-making using model parameters (in Section 4.1).

Agent-environment loop in active inference
Generative models are central to establishing the agent-environment loop in an active inference model.The agent is assumed to hold a scaled-down model of the external world that enables the agent to predict the external dynamics and future observations.The agent can then use its available actions to pursue future outcomes, ensuring survival.We stick to a partially observed Markov decision process (POMDP) based generative model Kaelbling et al. [1998] in this paper.POMDPs are a general case of Markov decision processes (MDPs), which are controllable Markov chains apt for modelling stochastic systems in a discrete state space.In the following section, we provide the specific details of a POMDP-based generative model.

POMDP-based generative models
In active inference, agents learn the generative model about external states and optimise their decisions by minimising variational free energy.The POMDP is a universal framework to model discrete state-space environments, where the likelihood and state transition are expressed as tractable categorical distributions.Thus, we adopted the POMDP as our agent's generative model.The POMDP-based generative model is formally defined as a tuple of finite sets (S, O, T, U, B, A, D, E) such that: • s t ∈ S : states and s 1 is a given initial state.
• o t ∈ O : where o t = s t in the fully observable setting, and o t = f (s t ) in a partially observable setting.
• T ∈ N + , is a finite time horizon available per episode.
• B : encodes one-step transition dynamics, such that P (s t |s t−1 , u t−1 , B) is the probability that action u t−1 taken at state s t−1 at time t − 1 results in s t at time t.
• A : encodes the likelihood distribution, P (o t |s t , A) for the partially observable setting.
• D : prior about the state (s) at the starting time point used for the Bayesian inference of state (s) at time t = 1.
• E : prior about action-selection used to take action in the simulations at time t = 1.
In the POMDP, hidden states (s) generate observation (o) through the likelihood mapping (A) in the form of a categorical distribution, P (o t |s t , A) = Cat(A).The states s are determined by the transition matrix (B) given the agent's action (u), P (s Thus, the generative model in question is given as: Under the mean-field approximation, an approximate posterior distribution (concerning hidden-states s) is given as: where the posterior beliefs about states and parameters are expressed as categorical distribution, Q(s t ) = Cat(s t ) and Dirichlet distribution, Q(A) = Dir(a), respectively.Hence, under this POMDP setup, variational free energy is given as: Variations of F give appropriate posterior expectations about states and parameters.Some optional parameters, depending on the specific decision-making scheme used, are: • C : prior preferences over outcomes, P (o|C).Here, C is the preference for the predefined goal state.This parameter is generally used in the planning-based active inference models Sajid et al. [2021], Paul et al. [2021].
• Γ(t): A time-specific risk parameter that the agent maintains to update the state-action mapping CL in the CL scheme as in Isomura et al. [2022].
• β(s, t): A state-dependent bias parameter used in the mixed model proposed in this paper.
These are used to parameterise the distribution of actions u, and actions are optimised through variational free energy minimisation.Further details are explained in the subsequent sections.

Decision-making schemes in active inference
Decision-making under active inference is formulated as minimising the (expected) variational free energy of future time steps Kaplan and Friston [2018], Friston et al. [2009], Friston [2012].This enables an agent to deploy a planningbased decision-making scheme, where an agent predicts possible outcomes and makes decisions to attain states and observations that minimise expected free energy (EFE).Classically, active inference optimises policies -i.e., sequences of actions in time -instead of a state-action mapping in methods like Q-Learning Sutton and Barto [2018] in RL to choose the policy that minimises EFE Sajid et al. [2021].However, such formulations limit agents to solve environments only with low-dimensional state-space Sajid et al. [2021], Paul et al. [2021].
Several improvements to the framework followed, including the recent sophisticated inference scheme Friston et al.
[2021] that uses a recursive form of free energy to ease the computational complexity of policy search.The sophisticated inference method uses a forward tree search in time to evaluate EFE; however, it restricts the planning depth of agents Friston et al. [2021] due to computational complexity.More innovative algorithms like dynamic programming can be used to linearise the planning Paul et al. [2023], Da Costa et al. [2020].The proposed linearised planning method was called Dynamic programming in expected free energy (DPEFE) in Paul et al. [2023].This DPEFE algorithm performs at par with benchmark reinforcement learning methods like Dyna-Q in environments similar to grid world tasks Paul et al. [2021] (See Section 2.3.1 for technical details of this method).A generalisation of the DPEFE algorithm was recently proposed as 'inductive-inference' to model 'intentional behaviour' in agents Friston et al. [2023].
Another recent work deviates from this classical approach of predictive planning and employs "learning from experience" to determine optimal decisions Isomura et al. [2022].This scheme is mathematically equivalent to a particular class of neural networks accompanied by some neuromodulations of synaptic plasticity Isomura and Friston [2020], Isomura et al. [2022].It uses counterfactual learning (the CL method in this paper) to accumulate a measure of 'risk' over time-based on environmental feedback.Subsequent work that validates this scheme experimentally using in-vitro neural networks has also appeared recently Isomura et al. [2023].
The following summarises the critical algorithmic details of both schemes: DPEFE in Sec.2.3.1 and CL scheme in Sec.2.3.2.Both schemes are proposed based on conventional POMDPs.

DPEFE scheme and action precision
The DPEFE scheme in this paper is based on the work in Paul et al. [2021].This scheme was generalised to a POMDP setting in the paper Paul et al. [2023].The model parameters used are as given in Sec.2.2.The action-perception loop in the DPEFE scheme comprises perception (i.e., identifying states that cause observations), planning, action selection, and learning model parameters.In this paper, all environments are fully observable since our focus is on decision-making rather than perception, hence O = S.
The action selection in the DPEFE scheme is implemented as follows: After evaluating the expected free energy (EFE, G) of future observations using dynamic programming (cf.Paul et al. [2023]), the agent evaluates the probability distribution for selecting an action u as: Here, σ is the classical softmax function, rendering actions with smaller EFE being selected with larger probabilities.The action precision parameter (α) may be tuned to increase/decrease the agent's action selection confidence.For a detailed description of the evaluation of the EFE (G) and the DPEFE algorithm, refer to Paul et al. [2023] (Section 5).

CL method and risk parameter
Instead of attempting to minimise the EFE directly, in the counterfactual learning (CL) method, the agent learns a state-action mapping CL.This state-action mapping is learned through an update equation mediated by a 'risk' term Γ t as defined in Isomura et al. [2022]: Here, ⟨•⟩ refers to the average over time, and ⊗ is the Kronecker-product operator.Given the state-action mapping CL, the agent samples actions from the distribution, In the simulations, Γ t with the following functional form is used: When the agent is at the start position -or when the agent's action causes a "high risk" -the value of 0.9 is substituted, i.e., Γ t ← 0.9.Otherwise, Γ t decreases continuously following the equation Here, T goal is when the agent receives a positive environmental reward.So, the sooner the agent comes to the desirable state, the quicker the Γ t (i.e., risk) converges to zero1 .
All the update rules defined in the paper can be derived from the postulate that the agent tries to minimise the (variational) free energy (Eq.3) w.r.t the generative model Paul et al. [2023], Isomura et al. [2022].In the rest of the paper, we investigate the performance of the two schemes -i.e. the DPEFE and the CL method -and consider a scheme combining them.The following section explores how these two schemes perform in a given environment.

Results
We now test the performance of two decision-making schemes (DPEFE and CL) in benchmark environments such as the Cart Pole -v1 (Fig. 1) from OpenAIGym.

Cart Pole -v1 (OpenAI Gym task)
In a Cart Pole -v1 environment, an agent is rewarded for balancing the pole upright (within an acceptable range) by moving the cart sideways (Fig. 1 (A)).An episode terminates when the pole or cart crosses the acceptable range (±12 degrees for the pole and ±2.4 units frame size for the cart, Fig. 1 (B)).This problem is inherently spontaneous, without the need for planning from the controller, where the agent must react to the current situation of the cart and the pole.
We then test the active inference in a mutating setup, where the environment mutates to a more challenging version with half the acceptable range for both the pole and cart position (±6 degrees for the pole and ±1.2 units frame size for the cart).The performance of the active inference agents with different planning is summarised in Fig. 2 (A).
As expected, the CL method agent outperforms other active inference schemes (As the problem demands spontaneous control, favouring a state-action mapping over planning).The agents quickly learn the necessary state-action mapping and balance the pole more effectively than other planning-based schemes.We observe this also after the mutation in the environment at episode number 100.The improved performance of the CL method agent after mutation warrants additional investigation; however, it can be attributed to the increased feedback frequency due to the increased failure rate after mutation.
In Fig. 2 (B), we see the evolution of the risk term (Γ).The risk Γ settles to a value less than 0.5 as the agent learns more about the environment.It is interesting to note the increase in Γ when faced with a mutation in the environment in Fig. 2

A B
Figure 2: A: Performance of active inference agents with different decision-making schemes in the mutating Cart Pole -v1 ( with a mutation at episode 100).After Episode 100, the environment mutates to a harder version, which the agents must adapt to.B: Evolution of the risk parameter (Γ t ) of the CL method agent when embodied in the Mutating Cart Pole problem.We can observe the spike at episode 100, consistent with mutation, and the reduced risk resulting in improved performance in the second half of the trial.

A B
Figure 3: A: A snapshot of the 900-state grid world (maze) environment.B: The optimal solution for the maze is shown in A. This is a complex maze, as when actions are taken randomly, it takes around 9000 steps to navigate the grid against the optimal route with 47 steps.
(B) as expected.The risk-reducing behaviour correlates with the increase in performance after episode number 100, highlighting the explainability of the active inference framework.Next, we test the agents in a fundamentally different environment -a maze task -which warrants the need for planning for the future.

Complex maze task and data-complexity trade-off
To compare the performance of the two agents in a strategic task, we simulate the performance in a standard grid world task as shown in Fig. 3 (A).The optimal solution to this grid problem is demonstrated in Fig. 3 (B).This is a complex grid world, which is non-trivial compared to grid world tasks used in the past literature to solve Sajid et al. [2021], as it will take around nine thousand steps for an agent to reach the goal state if actions are taken randomly against the optimal route with length 47.
The performance is evaluated regarding how soon the agent can finish an episode (i.e., the length of an episode (lower the better) for reaching the goal state).The simulation results showing the performance of DPEFE and CL agents are plotted in Fig. 4 (A).These results show that the predictive planning-based DPEFE agent can learn quickly (i.e., within ten episodes) to navigate this grid.In the simulations, the action precision used by the DPEFE agent is α = 1 substituted in (4).The agent tends to navigate in even lower time steps for a higher action precision (σ), always sticking to optimal actions.Additionally, we observe that the CL method agent takes longer to learn the optimal path.This result (Fig. 4 (A)) shows that the CL agent needs more experience in the environment (i.e. more data) to solve it.
In Fig. 4 (B), we compare major active inference algorithms' computational complexity associated with planning for decision-making.The DPEFE algorithm is computationally efficient compared to other popular active inference schemes Sajid et al. [2021], Friston et al. [2021].Please note that this figure also emphasises how the CL method has no computational complexity associated with planning.So, it is clear that the CL method agent is computationally cheaper than the DPEFE agent as there is no planning component.The computational complexity of the DPEFE agent is associated with the planning depth (time horizon of planning, T ), as seen in Fig. 4 (B).This demonstrates a data-complexity trade-off between both these schemes.
This realisation motivates us towards a mixed model, where we propose to develop an agent that can balance the two schemes according to the resources available to the agent.This makes much sense from the neuro-biological perspective, as biological agents continually try to balance resources to learn and plan for the future versus the experience they already have.This idea also relates to the classic exploration-exploitation dilemma in reinforcement learning Triche et al. [2022].[2021], the DPEFE method Paul et al. [2021] and CL method Isomura et al. [2022].Please note that the y-axis is in the log scale.The computational complexity was calculated for the algorithms to implement planning in a standard grid like in Fig. 3.

Integrating the two decision-making approaches
To enable the agent to balance its ability to predict future outcomes and use prior experience, we introduce a statedependent bias parameter that evolves with experience (β(s, t) ∈ [0, 1]) to the model.This addition is motivated by the hypothesis that an agent maintains a sense of bias, quantifying its confidence in the experience of deciding (in the past) in that particular state.
When exposed to a new environment, an agent starts with an equal bias for DEEFE (predictive planning) and CL schemes, represented by a prior bias parameter β prior = 0.5.
Over the episodes, the agent will have the probability distributions for decision-making from both models.These distributions enable decision-making given the present state (s).In a fully observable environment (MDP), s is known to the agent (i.e.O = S, or A = I, the identity mapping).In the partially observable case (POMDP), the agent infers the (hidden) state (s) from observation (o) by minimising variational free energy Da Costa et al. [2020], Sajid et al. [2021].
Given the state estimation, P (u|s) DPEFE and P (u|s) CL are the distributions used for sampling decision-making corresponding to the DPEFE scheme and CL method respectively (See Section 2.3.1 and Section 2.3.2 for details).
Given these distributions, the agent can now evaluate how 'useful' they are using their Shannon entropy (H(X)).This measure is beneficial as it represents how 'sure' that particular distribution is regarding a decision in that state(s).Namely, if the agent has confidence in a specific action, the action distribution tends to be a one-hot vector favouring the confident action; hence, the entropy of the distribution tends to zero, in contrast to the uniform distribution (not favouring any action) with maximum entropy.Thus, comparing this quantity enables the selection of the most confident strategy from the pool of different schemes.
Based on this observation, over time, the agent can use this entropy measure to update the value of β(s, t) as follows: Here, α is a normalisation parameter stabilising the updated value, and we make sure that β ∈ [0, 1] by re-calibrating β < 0 as β = 0 and β > 1 as β = 1.From a Bayesian inference perspective, one may view the updated belief β in Eq.( 8) as a posterior belief representing how likely the DPEFE model is selected, similar to the Bayesian model selection schemes.There is a key distinction between the generative process and the generative model in the active inference framework.In a POMDP, we assume that the observations are generated by the generative process ('Real world') by 'hidden-states' (s t ) through a state-observation mapping (A), both being inaccessible to the agent.In the generative model, the agent uses o t to maintain an optimal belief about the hidden state s t ('Perception').Subsequently, the agent uses the planning method (DPEFE) and Counterfactual (CL) method to combine the action distributions using the model-bias parameter β for decision-making.The decision at time t influences the hidden state of the 'Real world' at the next step, completing the loop.The generative process can be thought of as the environment the agent tries to survive in, whereas the generative model is completely part of the agent and can be interpreted as the 'imaginary' world the agent assumes it survives.
Using this measure of bias β(s t ), the agent can now evaluate a new distribution for decision-making, P M M , where MM stands for the mixed model as: The flow diagram describing the proposed mixed model's POMDP-based "agent-environment" loop is given in Fig. 5.2 .

Deriving update equations for the mixed model from variational free energy
Eqs.8 and 9 can be derived from variational free energy minimisation under a POMDP generative model.The variational free energy for the mixed model is defined as: When Γ t = 0 and β prior = 0.5, the derivative of F with respect to β = E[β] gives the posterior expectation as follows: Interestingly, this posterior expectation can be rewritten using the entropies of DPEFE and CL.The above F becomes variational free energy (Eq.3) for DPEFE or CL when β = 1 or 0, respectively.
Thus, from the definition of the Shannon entropy, we obtain and, Hence, β can be rewritten as: When |H CL − H DPEPE | << 1, Eq.8 approximates Eq.16.Minimisation of F further yields Eq.9 as it is an expression using the probability distribution and equivalent to the posterior expectation: Therefore, the update rules for the mixed model (Eqs.8 and 9) can be formally derived from variational free energy minimisation.

Performance of mixed-model in a mutating maze environment
We now examine the proposed mixed scheme with agents of different planning power (i.e.different planning depths, N3 ) in a similar environment.The computational complexity of the DPEFE scheme is linearly dependent on the planning time horizon (planning depth), i.e.T , and holds for the mixed-model agent as well (see Fig. 4).Thus, an agent with planning depth N = 50 takes up twice the computational resources while planning compared to an agent with N = 25.
We use a mutating grid environment to test the performance of the mixed model-based agent.This mutating grid scheme is illustrated in Fig. 6.The agent starts in a more accessible grid version with an optimal path of four steps (Fig. 6, (A)).
After 300 episodes, the environment mutates to the complex version of the grid shown in the previous section (See Fig.6 (B)).This setup also enables us to study how adaptable the agent is to new environmental changes.
The performance is summarised in Fig. 7.We observed that all three mixed model agents (with varying levels of planning ability) learned to navigate the more accessible grid within the first ten episodes (Fig. 7: A).However, when the environment mutated to the rigid grid in episode number 300, the agents learned similar to the performance we observe when navigating that grid alone, Fig. 7: B, (i.e., complex grid with 900 states).

A B
We also observed that the agent with higher planning ability learned to navigate the grid faster and more confidently than the other two.This result demonstrates that the proposed mixed model enables agents to balance the two decision-making approaches in the active inference framework.
It is considered that the brain of biological organisms also employs mechanisms to switch multiple strategies.Our model is potentially helpful for unveiling efficient decision-making mechanisms in the brain and their neuronal substrates and developing computationally efficient bio-mimetic agents.

Explainability of the active inference models
An additional advantage of the mixed model proposed (and the POMDP-based generative models) is that we can probe the model parameters to understand the basis of intelligent behaviour demonstrated by agents through the lens of active inference.Models that rely on artificial neural networks (ANNs) to scale up the models Fountas et al. [2020] have limited explainability regarding how agents make decisions, especially when faced with uncertainty.
In Fig. 8: (A), we can probe to see the evolution of the risk (Γ t ) in the model (associated with the CL method scheme as defined in Isomura et al. [2022]).We can observe that the model's risk quickly tends to zero when the easy grid is presented and solved; however, it shoots up when faced with the environment mutation.
Similarly, the evolution of the bias parameter (that balances the DPEFE and CL method in the mixed model) is shown in Fig. 8: (B).Here, we also observe how the agent consistently maintains a higher bias to the DPEFE model when it has a higher planning ability (i.e. the agent with a planning depth of N = 50 compared to bias in agents with N = 25, and N = 5).
We should note that the value of the bias parameter never increases more than 0.5, even when the DPEFE agent is planning at T = 50.In the simulations, we start with a bias β = 0.5 and update β according to (8).This shows how the agent eventually learns to rely on the mixed model's CL scheme (i.e., experience).Still, the DPEFE component (i.e.planning) accelerates learning and performance to aid decision-making.Such insights into the explainability of the agent's behaviour via model parameters help study the basis of natural/synthetic intelligence.

Conclusions
This paper compared and contrasted two decision-making schemes in the active inference framework.Observing the pros and cons of both approaches, we examined them on tasks that demand spontaneous (Cart Pole -v1) and strategic (maze) decision-making, thereby testing a hybrid approach.The insights observed in this work will help improve algorithms used for control, given the excitement around using active inference schemes Da Costa et al. [2022].
We leave the detailed analysis of behavioural dependence on parameters and model expansion in more demanding environments to future work.Systematic comparison with ANNs (Artificial Neural Networks) aided models like in the results of Fountas et al. [2020] is also a promising direction to pursue.

Software note
The grid environment and agents (DPEFE, CL and Mixed-model schemes) were custom-written in Python.All scripts are available at the following link: https://github.com/aswinpaul/aimmppcl_2023.

Figure 1 :
Figure 1: A: A snapshot from the Cart Pole -v1 environment (from OpenAI Gym), B: Environment summary: The objective is to balance the pole (brown) upright as long as possible without meeting the episode termination criteria, i.e. without the pole and cart crossing pole-angle and cart-position thresholds respectively.

Figure 4 :
Figure 4: A: Performance comparison of DPEFE and CL agents in the 900-state grid scheme with 300 episodes.The DPEFE agent learns to navigate the grid faster (With a lower episode length) than the CL method agent.B: Comparison of computational complexity between state-of-the-art active inference algorithms Sajid et al. [2021], Friston et al.[2021], the DPEFE methodPaul et al. [2021] and CL methodIsomura et al. [2022].Please note that the y-axis is in the log scale.The computational complexity was calculated for the algorithms to implement planning in a standard grid like in Fig.3.

Figure 5 :
Figure 5: Flow diagram of the agent-environment loop in the proposed mixed model combining planning and counterfactual learning.There is a key distinction between the generative process and the generative model in the active inference framework.In a POMDP, we assume that the observations are generated by the generative process ('Real world') by 'hidden-states' (s t ) through a state-observation mapping (A), both being inaccessible to the agent.In the generative model, the agent uses o t to maintain an optimal belief about the hidden state s t ('Perception').Subsequently, the agent uses the planning method (DPEFE) and Counterfactual (CL) method to combine the action distributions using the model-bias parameter β for decision-making.The decision at time t influences the hidden state of the 'Real world' at the next step, completing the loop.The generative process can be thought of as the environment the agent tries to survive in, whereas the generative model is completely part of the agent and can be interpreted as the 'imaginary' world the agent assumes it survives.

Figure 8 :
Figure 8: A: Evolution of the Risk parameter (Γ) of the mixed-model agent when embodied in the mutating grid scheme, B: Evolution of the model mixing parameter (β) of the mixed-model agent when embodied in the mutating grid scheme.