Deep Hybrid Models: Infer and Plan in a Dynamic World

Priorelli, Matteo; Stoianov, Ivilin Peev

doi:10.3390/e27060570

Open AccessArticle

Deep Hybrid Models: Infer and Plan in a Dynamic World

by

Matteo Priorelli

^1,2

and

Ivilin Peev Stoianov

^1,*

¹

Institute of Cognitive Sciences and Technologies, National Research Council of Italy, 35137 Padova, Italy

²

DIAG, Sapienza University of Rome, 00185 Roma, Italy

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(6), 570; https://doi.org/10.3390/e27060570

Submission received: 9 April 2025 / Revised: 7 May 2025 / Accepted: 24 May 2025 / Published: 27 May 2025

(This article belongs to the Special Issue Active Inference in Cognitive Neuroscience)

Download

Browse Figures

Versions Notes

Abstract

To determine an optimal plan for complex tasks, one often deals with dynamic and hierarchical relationships between several entities. Traditionally, such problems are tackled with optimal control, which relies on the optimization of cost functions; instead, a recent biologically motivated proposal casts planning and control as an inference process. Active inference assumes that action and perception are two complementary aspects of life whereby the role of the former is to fulfill the predictions inferred by the latter. Here, we present an active inference approach that exploits discrete and continuous processing, based on three features: the representation of potential body configurations in relation to the objects of interest; the use of hierarchical relationships that enable the agent to easily interpret and flexibly expand its body schema for tool use; the definition of potential trajectories related to the agent’s intentions, used to infer and plan with dynamic elements at different temporal scales. We evaluate this deep hybrid model on a habitual task: reaching a moving object after having picked a moving tool. We show that the model can tackle the presented task under different conditions. This study extends past work on planning as inference and advances an alternative direction to optimal control.

Keywords:

active inference; motor control; deep hybrid models

1. Introduction

Imagine a baseball player striking a ball with a bat. State-of-the-art approaches to simulate such goal-directed movements in a human-like manner often rely on optimal control [1,2]. This theory rests upon the formulation of goals in terms of value functions and their optimization via cost functions. While this approach has advanced both robotics and our understanding of human motor control, its biological plausibility remains disputed [3]. Moreover, value functions for motor control might restrict the range of motions an agent can learn [4]. In contrast, complex movements like handwriting or walking can emerge naturally from generative models encoding goals as prior beliefs about the environment [5]. The theory of active inference builds upon this premise, proposing that goal-directed behavior results from a biased internal representation of the world. This bias generates a cascade of prediction errors forcing the agent to sample those observations that make its beliefs true [6,7,8,9]. In this view, the tradeoff between exploration and exploitation arises naturally, driving the agent to minimize the uncertainty of its internal model before maximizing potential rewards [10]. Active inference shares foundational principles with predictive coding [11,12] and the Bayesian brain hypothesis, which postulates that the brain makes sense of the world by constructing a hierarchical generative model that continuously makes perceptual hypotheses and refines them [13].

This perspective offers a promising avenue for advancing current robotics and machine learning, particularly in research that frames control and planning as an inference process [14,15,16,17,18]. A distinctive feature of active inference is its ability to model the environment as a hierarchy of causes and dynamic states evolving at different timescales [19], which is fundamental for biological phenomena such as linguistic communication [20], and for advanced movements such as that of a baseball player. In the latter, several characteristics of the nervous system can be well captured by an active inference agent, providing a robust alternative to optimal control. First, the human brain is assumed to maintain a hierarchical representation of the body that generates the motor commands required to achieve the desired goal [21]. This representation must be flexible, in that the baseball player should relate the configuration of the bat to his body, acting as an extension of his hand [22]. Tool-use experiments in non-human primates revealed that parietal and motor regions rapidly adapt to incorporate tools into the body schema, enabling seamless interaction with objects [23]. Further, the player’s brain should maintain a dynamic representation of the moving ball in order to predict its trajectory before and after the hit. The posterior parietal cortex is known to encode multiple objects in parallel during action sequences, forming visuomotor representations that also account for object affordances [24]. In addition, distinct neural populations in the dorsal premotor cortex are known to encode multiple reaching options in decision-making tasks, with one population being activated and the others suppressed as the decision unfolds [25]. In short, the human brain can maintain multiple potential body configurations and potential trajectories appropriate for specific tasks.

Despite its potential, the development of deep hierarchical models within active inference remains limited. Adaptation to complex data still relies primarily on neural networks as generative models [26,27,28,29,30,31,32,33,34,35]. One study demonstrated that a deep hierarchical agent, equipped with independent dynamics functions for each degree of freedom (DoF), could continuously adjust its internal trajectories to align with prior expectations, enabling advanced control of complex kinematic chains [36]. This ability to learn and act across intermediate timescales offers significant advantages for solving control tasks. Moreover, simulating real-world scenarios can benefit from so-called hybrid or mixed models in active inference, which integrate discrete decision making with continuous motion. However, state-of-the-art applications of such models have simulated static contexts only [8,37,38,39,40].

In this work, we address these challenges from a unified perspective. Our key contributions are summarized as follows:

We present an active inference agent affording robust planning in dynamic environments. The basic unit of this agent maintains potential trajectories, enabling a high-level discrete model to infer the state of the world and plan composite movements. The units are hierarchically combined to represent potential body configurations, incorporating object affordances (e.g., grasping a cup by the handle or with the whole hand) and their hierarchical relationships (e.g., how a tool can extend the agent’s kinematic chain).
We introduce a modular architecture designed for tasks that involve deep hierarchical modeling, such as tool use. Its multi-input and multi-output connectivity resembles traditional neural networks and represents an initial step toward designing deep structures in active inference capable of generalizing across and learning novel tasks.
We evaluate the agent’s performance in a common task: reaching a moving ball after reaching and picking a moving tool. The results highlight the interplay between the agent’s potential trajectories and the dynamic accumulation of sensory evidence. We demonstrate the agent’s ability to infer and plan under different conditions, such as random object positions and velocities.

2. Methods

2.1. Predictive Coding

According to predictive coding (PC), the human brain makes sense of the world by constructing an internal generative model of how hidden states of the environment generate the perceived sensations [11,12,41]. This internal model is continuously refined by minimizing the discrepancy (called prediction error) between sensations and the respective predictions. More formally, given some prior knowledge

p (x)

over hidden states

x

and partial evidence

p (y)

over sensations

y

, the nervous system can find the posterior distribution of the hidden states given the sensations via Bayes rule:

p (x | y) = \frac{p (x, y)}{p (y)}

(1)

However, direct computation of the posterior

p (x | y)

is unfeasible since the evidence requires marginalizing over every possible outcome, i.e.,

p (y) = \int p (x, y) d x

. Predictive coding supposes that, instead of performing exact Bayesian inference, organisms are engaged in a variational approach [42], i.e., they approximate the posterior distribution with a simpler recognition distribution

q (x) \approx p (x | y)

, and minimize the difference between the two distributions. This difference is expressed in terms of a Kullback–Leibler (KL) divergence:

D_{K L} [q (x) | | p (x | y)] = \int_{x} q (x) ln \frac{q (x)}{p (x | y)} d x

(2)

Given that the denominator

p (x | y)

still depends on the marginal

p (y)

, we express the KL divergence in terms of the log evidence and the Variational Free Energy (VFE), and minimize the latter quantity instead. The VFE is the negative of what in the machine learning community is known as the evidence lower bound or ELBO [43]:

F = \underset{q (x)}{E} [ln \frac{q (x)}{p (x, y)}] = \underset{q (x)}{E} [ln \frac{q (x)}{p (x | y)}] - ln p (y)

(3)

Since the KL divergence is always nonnegative, the VFE provides an upper bound on surprise, i.e.,

F \geq - ln p (y)

. Therefore, minimizing

F

is equivalent to minimizing the KL divergence with respect to

q (x)

. The more the VFE approaches 0, the closer the approximate distribution is to the real posterior, and the higher the model evidence (or, equivalently, the lower the surprise about sensory outcomes will be). The discrepancy between the two distributions also depends on the specific assumptions made over the recognition distribution; a common one is the Laplace approximation [42], which assumes Gaussian probability distributions, e.g.,

q (x) = N (μ, Σ)

, where

μ

represents the most plausible hypothesis—also called belief about the hidden states

x

—and

Σ

is its covariance matrix. Now, we factorize the generative model

p (x, y)

into likelihood and prior terms, and we further parameterize it with some parameters

θ

:

\begin{matrix} \begin{matrix} p (x, y) & = p (y | x, θ) p (x) \\ p (y | x, θ) & = N (g (x, θ), Σ_{y}) \\ p (x) & = N (η, Σ_{η}) \end{matrix} \end{matrix}

(4)

where

g (x, θ)

is a likelihood function and

η

is a prior. In this way, after applying the logarithm over the Gaussian terms, the VFE breaks down to the following simple formula:

F = - \frac{1}{2} [Π_{y} ε_{y}^{2} + Π_{η} ε_{η}^{2} + ln 2 π Σ_{y} + ln 2 π Σ_{η}]

(5)

where we expressed the difference in the exponents of the Gaussian distributions in terms of prediction errors

ε_{y} = y - g (μ, θ)

and

ε_{η} = μ - η

, and we wrote the covariances in terms of their inverse, i.e., precisions

Π_{y}

and

Π_{η}

. In order to minimize the KL divergence between the real and approximate posteriors, we can minimize the VFE with respect to the beliefs

μ

—a process associated with perception—and the parameters

θ

—generally referred to as learning. In practice, the update rules follow gradient descent:

\begin{matrix} \begin{matrix} μ & = \underset{μ}{arg min} F \\ \dot{μ} & = - \partial_{μ} F = \partial_{μ} g^{T} Π_{y} ε_{y} - Π_{η} ε_{η} \end{matrix} & \begin{matrix} θ & = \underset{θ}{arg min} F \\ \dot{θ} & = - \partial_{θ} F = - \partial_{θ} g^{T} Π_{y} ε_{y} \end{matrix} \end{matrix}

(6)

In order to separate the timescales of fast perception and slow-varying learning, the two phases are treated as the steps of an EM algorithm, i.e., optimizing the beliefs while keeping the parameters fixed, and then optimizing the parameters while keeping the beliefs fixed.

Predictive coding, originally rooted in data compression [44], has inspired biologically plausible alternatives to traditional deep learning, such as Predictive Coding Networks (PCNs) [45,46]. In fact, the predictive coding algorithm can be scaled up to learn highly complex structures, composed of causal relationships of arbitrarily high depth—in a similar way to deep neural networks. In particular, we can factorize a generative model into a product of different distributions, wherein a specific level only depends on the level above:

\begin{matrix} \begin{matrix} p (x^{(0)}, \dots, x^{(l)}) & = p (x^{(0)}) \prod_{l = 1}^{L} p (x^{(l)} | x^{(l - 1)}) \\ p (x^{(l)} | x^{(l - 1)}) & = N (g (x^{(l - 1)}, θ^{(l - 1)}), Σ^{(l)}) \end{matrix} \end{matrix}

(7)

where 0 indicates the highest level, and L is the number of levels in the hierarchy. In this way,

x^{(l)}

acts as an observation for level

l - 1

, and

g (x^{(l)}, θ^{(l)})

acts as a prior for level

l + 1

. As a result, this factorization allows us to express the update of beliefs and parameters of a specific level only based on the level above and the level below—with a simple VFE minimization as in the previous case. Differently from the backpropagation algorithm of neural networks, the message passing of prediction errors implements a biologically plausible Hebbian rule between presynaptic and postsynaptic activities. In fact, the predictions

u

computed by the likelihood functions are typically a weighted combination of neurons passed to a nonlinear activation function

ϕ

:

u_{i}^{(l)} = \sum_{j = 1}^{J} W_{i, j}^{(l - 1)} ϕ (x_{j}^{(l - 1)})

(8)

where

W_{i, j}^{(l - 1)}

are the weights from neuron j at level

l - 1

to neuron i at level l. Crucially, what is backpropagated in PCNs are not signals detecting increasingly complex features, but messages representing how much the model is surprised about sensory observations (e.g., a prediction equal to an observation means that the network structure is a good approximation of the real process, and no errors have to be conveyed).

2.2. Hierarchical Active Inference

The theory of active inference builds on the same assumptions of predictive coding, but with two critical differences. The first one—which is actually shared with some implementations of predictive coding such as temporal PC [47]—assumes that living organisms constantly deal with highly dynamic environments, and the internal models that they build must reflect the changes occurring in the real generative process [48]. This relation is usually expressed in terms of generalized coordinates of motion (encoding, e.g., position, velocity, acceleration, and so on) [49]; consequently, the environment is modeled with the following nonlinear system:

\begin{matrix} \begin{matrix} \tilde{y} & = \tilde{g} (\tilde{x}) + w_{y} \\ D \tilde{x} & = \tilde{f} (\tilde{x}, \tilde{v}) + w_{x} \end{matrix} \end{matrix}

(9)

where

\tilde{x}

are the generalized hidden states,

\tilde{v}

are the generalized hidden causes,

\tilde{y}

are the generalized sensory signals,

D

is a differential operator that shifts all the temporal orders by one, i.e.,

D \tilde{x} = [x^{'}, x^{″}, x^{‴}, \dots]

, and the letter w indicates (Gaussian) noise terms. The likelihood function

\tilde{g}

defines how hidden states generate sensory observations (as in predictive coding), while the dynamics function

\tilde{f}

specifies the evolution of the hidden states (see [8,50] for more details). The associated joint probability is factorized into independent distributions:

p (\tilde{y}, \tilde{x}, \tilde{v}) = p (\tilde{y} | \tilde{x}) p (\tilde{x} | \tilde{v}) p (\tilde{v})

(10)

where each distribution is Gaussian:

\begin{matrix} \begin{matrix} p (\tilde{y} | \tilde{x}) & = N (\tilde{g} (\tilde{x}), {\tilde{Σ}}_{y}) \\ p (D \tilde{x} | \tilde{v}) & = N (\tilde{f} (\tilde{x}, \tilde{v}), {\tilde{Σ}}_{x}) \\ p (\tilde{v} | η) & = N (η, {\tilde{Σ}}_{v}) \end{matrix} \end{matrix}

(11)

As in predictive coding, these distributions are inferred through approximate posteriors

q (\tilde{x})

and

q (\tilde{v})

, minimizing the related VFE

F

. As a result, the updates of the beliefs

\tilde{μ}

and

\tilde{ν}

, respectively, over the hidden states and hidden causes become:

\begin{matrix} \begin{matrix} \dot{\tilde{μ}} - D \tilde{μ} & = - \partial_{μ} F = \partial {\tilde{g}}^{T} {\tilde{Π}}_{y} {\tilde{ε}}_{y} + \partial_{μ} {\tilde{f}}^{T} {\tilde{Π}}_{x} {\tilde{ε}}_{x} - D^{T} {\tilde{Π}}_{x} {\tilde{ε}}_{x} \\ \dot{\tilde{ν}} - D \tilde{ν} & = - \partial_{ν} F = \partial_{ν} {\tilde{f}}^{T} {\tilde{Π}}_{x} {\tilde{ε}}_{x} - {\tilde{Π}}_{v} {\tilde{ε}}_{v} \end{matrix} \end{matrix}

(12)

where

{\tilde{Π}}_{y}

,

{\tilde{Π}}_{x}

, and

{\tilde{Π}}_{v}

are the precisions, and

{\tilde{ε}}_{y}

,

{\tilde{ε}}_{x}

, and

{\tilde{ε}}_{v}

are, respectively, the prediction errors of sensory signals, dynamics, and priors:

\begin{matrix} {\tilde{ε}}_{y} & = \tilde{y} - \tilde{g} (\tilde{μ}) \\ {\tilde{ε}}_{x} & = D \tilde{μ} - \tilde{f} (\tilde{μ}, \tilde{ν}) \\ {\tilde{ε}}_{v} & = \tilde{ν} - η \end{matrix}

(13)

For a full account of free energy minimization in active inference, see [8]. Unlike the update rules of predictive coding, additional terms arise from the inferred dynamics, through which the model can capture the evolution of the environment. Also, note that since we are minimizing over (dynamic) paths and not (static) states, an additional term

D \tilde{μ}

is present; this implies trajectory tracking, during which the VFE is minimized only when the belief of the generalized hidden states

D \tilde{μ}

matches its instantaneous trajectory

\dot{\tilde{μ}}

.

The second assumption made by active inference is that our brains not only perceive (and learn) the external generative process, but also interact with it to reach desired states (e.g., in order to survive, not only must one understand which cause leads to an increase or decrease in temperature, but also take actions to live in a narrow range around 37 degrees Celsius). The free energy principle, which is at the core of active inference, states that all living organisms act to minimize the free energy (or, equivalently, surprise). In fact, in addition to the perceptual inference of predictive coding, the VFE can be minimized by sampling those sensory observations that conform to some prior beliefs, e.g.,

a = {arg min}_{a} F

, where

a

are the motor commands. This process is typically called self-evidencing, implying that if I believe that I find myself in a narrow range of temperatures, minimizing surprise via action will lead to finding places with such temperatures—hence making my beliefs true. One role of the hidden causes is to define the agent’s priors that ensure survival. These priors generate proprioceptive predictions which are suppressed by motor neurons via classical reflex arcs [51]:

\dot{a} = - \partial_{a} F_{p} = - \partial_{a} {\tilde{y}}_{p} {\tilde{Π}}_{p} {\tilde{ε}}_{p}

(14)

where

\partial_{a} {\tilde{y}}_{p}

is an inverse model from (proprioceptive) observations to actions, and

{\tilde{ε}}_{p} = {\tilde{y}}_{p} - {\tilde{g}}_{p} (\tilde{μ})

are the generalized proprioceptive prediction errors.

As in predictive coding, we can scale up this generative model to capture the causal relationships of several related entities (e.g., the joints of a kinematic structure), up to a certain depth [19,36,37,52]. Therefore, the prior becomes the prediction from the layer above, while the observation becomes the likelihood of the layer below:

\begin{matrix} \begin{matrix} {\dot{\tilde{μ}}}^{(l)} & = D {\tilde{μ}}^{(l)} + \partial {\tilde{g}}^{(l) T} {\tilde{Π}}_{v}^{(l)} {\tilde{ε}}_{v}^{(l)} + \partial_{μ} {\tilde{f}}^{(l) T} {\tilde{Π}}_{x}^{(l)} {\tilde{ε}}_{x}^{(l)} - D^{T} {\tilde{Π}}_{x}^{(l)} {\tilde{ε}}_{x}^{(l)} \\ {\dot{\tilde{ν}}}^{(l)} & = D {\tilde{ν}}^{(l)} + \partial_{ν} {\tilde{f}}^{(l) T} {\tilde{Π}}_{x}^{(l)} {\tilde{ε}}_{x}^{(l)} - {\tilde{Π}}_{v}^{(l - 1)} {\tilde{ε}}_{v}^{(l - 1)} \end{matrix} \end{matrix}

(15)

where:

\begin{matrix} {\tilde{ε}}_{x}^{(l)} & = D {\tilde{μ}}_{x}^{(l)} - {\tilde{f}}^{(l)} ({\tilde{μ}}^{(l)}, {\tilde{ν}}^{(l)}) \\ {\tilde{ε}}_{v}^{(l)} & = {\tilde{μ}}_{v}^{(l + 1)} - {\tilde{g}}^{(l)} ({\tilde{μ}}^{(l)}) \end{matrix}

(16)

and the subscript indicates the index of the level, as before. Here, the role of the hidden causes is to link hierarchical levels. This allows the agent to construct a hierarchy of representations varying at different temporal scales, critical for realizing richly structured behaviors such as linguistic communication [20] or singing [53]. In this case, the motor commands minimize the generalized prediction errors of the lowest level of the hierarchy.

This continuous formulation of active inference is highly effective for dealing with realistic environments. However, minimizing the VFE (which is the free energy of the past and present) does not afford planning and decision making in the immediate future. To do this, we construct a discrete generative model in which we discretize the possible future states and encode their expectations with categorical distributions. We then minimize the Expected Free Energy (EFE) which, as the name suggests, is the free energy that the agents expect to perceive in the future [8,54]. This discrete generative model is similar to the continuous model defined above, with the difference that we condition the hidden states over policies

π

(which in active inference are sequences of discrete actions):

p (s, o, π) = p (o | s, π) p (s | π) p (π)

(17)

Here,

s

and

o

are discrete states and outcomes, which represent past, present, and future states as in Hidden Markov Models. Then, the EFE is specifically constructed by considering future states as random variables that need to be inferred:

G_{π} = \underset{q (s, o | π)}{E} [ln \frac{q (s | π)}{p (s, o | π)}] \approx \underset{q (s, o | π)}{E} [ln \frac{q (s)}{q (s | o, π)}] - \underset{q (o | π)}{E} [ln p (o | C)]

(18)

where

q (o | π)

is a recognition distribution that the agent constructs to infer the real posterior of the generative process. Critically, the probability distribution

p (o | C)

encodes preferred outcomes, acting similarly to a prior over the hidden causes in the continuous counterpart. The last two terms are, respectively, called epistemic (uncertainty reducing) and pragmatic (goal seeking). In practice, this quantity is used by first factorizing the agent’s generative model as in POMDPs:

p (s_{1 : T}, o_{1 : T}, π) = p (s_{1}) \cdot p (π) \cdot \prod_{τ = 1}^{T} p (o_{τ} | s_{τ}) \cdot \prod_{τ = 2}^{T} p (s_{τ} | s_{τ - 1}, π)

(19)

Each of these elements can be represented with categorical distributions:

\begin{matrix} \begin{matrix} p (s_{1}) & = C a t (D) \\ p (π) & = C a t (E) \end{matrix} & \begin{matrix} p (o_{τ} | s_{τ}) & = C a t (A) \\ p (s_{τ} | s_{τ - 1}, π) & = C a t (B_{π, τ}) \end{matrix} \end{matrix}

(20)

where

D

encodes beliefs about the initial state,

E

encodes the prior over policies,

A

is the likelihood matrix and

B_{π, τ}

is the transition matrix. The minimization of EFE in discrete models follows the variational method used by predictive coding and continuous-time active inference; specifically, the use of categorical distributions breaks down the inference of hidden states and policies to a simple local message passing:

\begin{matrix} \begin{matrix} s_{π, τ} & = σ (ln B_{π, τ - 1} s_{π, τ - 1} + ln B_{π, τ}^{T} s_{π, τ + 1} + ln A^{T} o_{τ}) \\ π & = σ (ln E - G) \\ G_{π} & \approx \sum_{τ} A s_{π, τ} (ln A s_{π, τ} - ln p (o_{τ} | C)) - s_{π, τ} d i a g (A^{T} ln A) \end{matrix} \end{matrix}

(21)

For a complete treatment of how the EFE and the approximate posteriors are computed—along with other insightful implications of the free energy principle in discrete models—see [8], while [55] provides a more practical tutorial with basic applications. Equation (21) shows that the update rule for the discrete hidden states at time

τ

and conditioned over a policy

π

is a combination of messages coming from the previous and next discrete time steps, and a message coming from the discrete outcome. This combination is passed to a softmax function in order to obtain a proper probability. In addition, the optimal policy

π

is found by a combination of a policy prior and the EFE, where the latter is a composition of epistemic and pragmatic behaviors. Policy inference can be refined by computing the EFE of several steps ahead in the future—a process called sophisticated inference. Then, at each discrete step

τ

, the agent selects the most likely action u under all policies, i.e.,

u_{t} = {arg max}_{u} π \cdot [U_{π, t} = u]

.

As before, we can construct a hierarchical structure able to express more and more invariant representations of hidden states [56]. In discrete state-space, links between hierarchical levels are usually made between hidden states via the matrix

D

—although in some formulations, the hidden states of a level condition the policies of the subordinate levels. The update rules for this hierarchical alternative become:

\begin{matrix} \begin{matrix} s_{π, τ}^{(l)} & = σ (ln B_{π, τ - 1}^{(l)} s_{π, τ - 1}^{(l)} + ln B_{π, τ}^{(l) T} s_{π, τ + 1}^{(l)} + ln A^{(l) T} o_{τ}^{(l)} + ln D^{(l + 1) T} s_{1}^{(l + 1)}) \\ π^{(l)} & = σ (ln E^{(l)} - G^{(l)}) \\ G_{π}^{(l)} & \approx \sum_{τ} A^{(l)} s_{π, τ}^{(l)} (ln A^{(l)} s_{π, τ}^{(l)} - ln p (o_{τ}^{(l)} | C)) - s_{π, τ}^{(l)} d i a g (A^{(l) T} ln A^{(l)}) \end{matrix} \end{matrix}

(22)

This implies that each level takes abstract actions to minimize its EFE, only depending on the levels immediately below and above.

2.3. Bayesian Model Comparison

Bayesian model comparison is a technique used to compare a posterior over some data with a few simple hypotheses known a priori [57]. Consider a generative model

p (y, θ)

with parameters

θ

and data

y

:

p (θ, y) = p (y | θ) p (θ)

(23)

We introduce additional distributions

p (y, θ | m)

which are reduced versions of the first model if the likelihood of some data is the same under both models, i.e.,

p (y | θ, m) = p (y | θ)

, and the only difference rests upon the specification of the priors

p (θ | m)

. We can express the posterior of the reduced models in terms of the posterior of the full model and the ratios of the priors and the evidence:

p (θ | y, m) = p (θ | y) \frac{p (θ | m) p (y)}{p (θ) p (y | m)}

(24)

The procedure for computing the reduced posteriors is the following. First, we integrate over the parameters to obtain the evidence ratio of the two models:

p (y | m) = p (y) \int p (θ | y) \frac{p (θ | m)}{p (θ)} d θ

(25)

Then, we define an approximate posterior

q (θ)

and we compute the reduced free energies of each model m:

F [p (θ | m)] \approx F [p (θ)] + ln \underset{q}{E} [\frac{p (θ | m)}{p (θ)}]

(26)

This VFE acts as a hint to how well the reduced representation explains the full model. Similarly, the approximate posterior of the reduced model can be written in terms of the posterior of the full model:

ln q (θ | m) = ln q (θ) + ln \frac{p (θ | m)}{p (θ)} - ln \underset{q}{E} [\frac{p (θ | m)}{p (θ)}]

(27)

The Laplace approximation [58] leads to a simple form of the approximate posterior and the reduced free energy. Assuming the following Gaussian distributions:

\begin{matrix} \begin{matrix} p (θ) & = N (η, Σ) \\ p (θ | m) & = N (η_{m}, Σ_{m}) \end{matrix} & \begin{matrix} q (θ) & = N (μ, C) \\ q (θ | m) & = N (μ_{m}, C_{m}) \end{matrix} \end{matrix}

(28)

the reduced free energy turns to:

\begin{matrix} \begin{matrix} F [p (θ | m)] & \approx F [p (θ)] + \frac{1}{2} ln | Π_{m} P C_{m} Σ | \\ - \frac{1}{2} (μ^{T} P μ - μ_{m}^{T} P_{m} μ_{m} - η^{T} Π η + η_{m}^{T} Π_{m} η_{m}) \end{matrix} \end{matrix}

(29)

expressed in terms of precision of priors

Π

and posteriors

P

. The reduced posterior mean and precision are computed via Equation (27):

\begin{matrix} \begin{matrix} μ_{m} & = C_{m} (P μ - Π η + Π_{m} η_{m}) \\ P_{m} & = P - Π + Π_{m} \end{matrix} \end{matrix}

(30)

In this way, two different hypotheses i and j can be easily compared to infer which is the most likely to have generated the observed data; this comparison has the form of a log-Bayes factor

F [p (θ | i)] - F [p (θ | j]

. For a more detailed treatment of Bayesian model comparison (in particular under the Laplace approximation), see [57,59].

3. Results

3.1. Deep Hybrid Models

Analyzing the emergence of distributed intelligence, Friston et al. [60] emphasized three kinds of depth within the framework of active inference: factorial, hierarchical, and temporal. Factorial depth assumes independent factors in the agent’s generative model (e.g., objects and qualities of an environment, or more abstract states), which can be combined to generate outcomes and transitions. Hierarchical depth introduces causal relationships between levels, inducing a separation of temporal scales whereby higher levels happen to construct more invariant representations, while lower levels better capture the rapid changes of sensory stimuli. Temporal depth entails, in discrete terms, a vision into the imminent future that can be used for decision making, or, in continuous terms, increasingly precise estimates of dynamic trajectories.

In the following, we present the main features of a deep hybrid model in terms of factorial, temporal, and hierarchical depths in the context of flexible behavior, iterative transformations of reference frames, and dynamic planning. By deep hybrid model, we mean an active inference model composed of hybrid units connected hierarchically. Here, hybrid means that discrete and continuous representations are encoded within each unit, wherein the communication between the two domains is achieved by Bayesian model reduction [57,59]. As a technical note, all the internal operations can be computed through automatic differentiation, i.e., by maintaining the gradient graph when performing each forward pass and propagating back the prediction errors. For a detailed treatment of predictive coding, hierarchical active inference in discrete and continuous state-spaces, and Bayesian model comparison, see Section 2.1, Section 2.2 and Section 2.3, respectively.

3.1.1. Factorial Depth and Flexible Behavior

Consider the case where one agent needs to infer which object is being followed by another agent. This can be achieved through a hybrid unit

U

, whose factor graph is depicted in Figure 1a. The variables are continuous hidden states

\tilde{x} = [x, x^{'}]

, observations

\tilde{y} = [y, y^{'}]

, and (discrete) hidden causes

v

. Hidden states and observations comprise two temporal orders (e.g.,

x

encodes the position and

x^{'}

the velocity), and the first temporal order will be indicated as the 0th order; see the Methods section for more information. The factors are dynamics functions

f

and likelihood functions

\tilde{g} = [g, g^{'}]

. Note also a prior over the 0th-order hidden states

η_{x}

, and a prior over the hidden causes

H_{v}

encoding the agent’s goals. The generative model is the following:

p (\tilde{x}, v, \tilde{y}) = p (\tilde{y} | \tilde{x}) p (x^{'} | x, v) p (x) p (v)

(31)

We assume that hidden causes and hidden states have different dimensions, resulting in two factorizations. The hidden states, with dimension N, are sampled from independent Gaussian distributions and generate predictions in parallel pathways:

\begin{matrix} p (x) & = \prod_{n}^{N} N (η_{x, n}, Σ_{η, x, n}) & p (\tilde{y} | \tilde{x}) & = \prod_{n}^{N} N ({\tilde{g}}_{n} ({\tilde{x}}_{n}), {\tilde{Σ}}_{y, n}) \end{matrix}

(32)

where

Σ_{η, x, n}

and

{\tilde{Σ}}_{y, n}

are their covariance matrices. In turn, the hidden causes, with dimension M, are sampled from a categorical distribution:

p (v) = C a t (H_{v})

(33)

This differs from state-of-the-art hybrid architectures which assume separate continuous and discrete models with continuous hidden causes (see Section 2.2) [48]. A discrete hidden cause

v_{m}

concurs, with hidden states

x

, in generating a specific prediction for the 1st temporal order

x^{'}

:

p (x^{'} | x, m) = N (f_{m} (x), Σ_{x, m})

(34)

This probability distribution entails a potential trajectory, or hypothetical evolution of the hidden states, which the agent maintains to infer the state of affairs of the world and act. More formally, we consider

p (x^{'} | x, m)

as being the mth reduced version of a full model:

p (x^{'} | x, v) = N (η_{x}^{'}, Σ_{x})

(35)

This allows us—using the variational approach for approximating the true posterior distributions—to convert discrete signals into continuous signals and vice versa through Bayesian model average and Bayesian model comparison, respectively (see Section 2.3 and [57,59]). In particular, top-down messages combine the potential trajectories

f_{m} (x)

with the related probabilities encoded in the discrete hidden causes

v

:

η_{x}^{'} = \sum_{m}^{M} v_{m} f_{m} (x)

(36)

This computes a dynamic path that is an average of the agent’s hypotheses, based on its prior

H_{v}

. Conversely, bottom-up messages compare the agent’s prior surprise

- ln H_{v}

with the log evidence

l_{m}

of every reduced model, i.e.,

v = σ (ln H_{v} + l)

, where

l = [l_{1}, \dots, l_{M}]

and

σ

is a softmax function. The log evidence is accumulated over a continuous time T:

l_{m} = \int_{0}^{T} \frac{1}{2} (μ_{m}^{' T} P_{x, m} μ_{m}^{'} - f_{m} {(x)}^{T} Π_{x, m} f_{m} (x) - μ^{' T} P_{x} μ^{'} + η_{x}^{' T} Π_{x} η_{x}^{'}) d t

(37)

where

μ_{m}

,

P_{x, m}

, and

Π_{x, m}

are the mean, posterior precision, and prior precision of the mth reduced model. In this way, the agent can infer which dynamic hypothesis is most likely to have generated the perceived trajectory; see Figure 1b and [61] for more details about this approach.

A hybrid unit has useful features deriving from the factorial depths of hidden states and causes. Consider the case where the hidden states encode the agent’s configuration and other environmental objects, while the hidden causes represent the agent’s intentions. A hybrid unit could dynamically assign the causes of its actions at a particular moment; this is critical, e.g., in a pick and place operation, during which an object is first the cause of the hand movements—resulting in a picking action—but then it is the consequence of another cause (i.e., a goal position)—resulting in a placing action. This approach differs from other solutions [6,62] that directly encode a target location in the hidden causes. Further, embedding environmental entities—and not just the self—into the hidden states permits inferring their dynamic trajectories, which is fundamental for interactions, e.g., in catching objects on the fly [63] or in tracking a hidden target with the eyes [64]. Considering the example in Figure 1b, the hidden states

x

and

x^{'}

may encode the angle and angular velocity of an agent’s arm, and two hidden causes

v_{c i r c l e}

and

v_{s q u a r e}

may be associated with dynamics functions

f_{c i r c l e}

and

f_{s q u a r e}

encoding (potential) reaching movements toward the two objects. The environment may be inferred by two likelihood functions, one—

g_{p}

—predicting proprioceptive observations (i.e., arm and angular velocity), and another one—

g_{v}

—predicting visual observations (i.e., the hand position and velocity). Since we are interested in inferring which object is being followed by the agent, we set a uniform discrete prior

H_{v}

. Then, Equation (37) compares the actual agent’s trajectory to the two reaching movements toward the two objects, and assigns a higher probability to the one better resembling the real trajectory.

3.1.2. Hierarchical Depth and Iterative Transformations

Hierarchical depth is critical in many tasks that require learning of modular and flexible functions. Considering motor control, forward kinematics is repeated throughout every element of the kinematic chain, computing the end effector position from the body-centered reference frame. Iterative transformations are also fundamental in computer vision, where camera models perform roto-translations and perspective projections in sequence. How can we express such hierarchical computations in terms of inference? We design a structure called Intrinsic-Extrinsic (or IE) module, performing iterative transformations between reference frames [36,65]. A unit

U_{e}^{(i)}

—where the superscript indicates the ith level—encodes a signal

x_{e}^{(i)}

in an extrinsic reference frame (e.g., Cartesian coordinates), while another unit

U_{i}^{(i)}

represents an intrinsic signal

x_{i}^{(i)}

(e.g., polar coordinates). At each level i, a likelihood function

g_{e}^{(i)}

applies a transformation to the extrinsic signal provided by the higher level based on the intrinsic information, and returns a new extrinsic state:

x_{e}^{(i)} = g_{e}^{(i)} (x_{i}^{(i)}, x_{e}^{(i - 1)}) + w_{e} = T^{(i)} (x_{i}^{(i)}) \cdot x_{e}^{(i - 1)} + w_{e}

(38)

where

w_{e}

is a noise term and

T^{(i)}

is a linear transformation matrix. This new state acts as a prior for the subordinate levels in a multiple-output system; indicating with the superscript

(i, j)

the ith hierarchical level and the jth unit within the same level, we link the IE modules in the following way:

y_{e}^{(i, j)} \equiv x_{e}^{(i + 1, j)} x_{e}^{(i - 1, j)} \equiv η_{e}^{(i, j)}

(39)

as displayed in Figure 2; hence, the observation of level i becomes the prior over the hidden states of level

i + 1

. Ill-posed problems that generally have multiple solutions—such as inverse kinematics or depth estimation—can be solved by inverting the agent’s generative model and backpropagating the sensory prediction errors, with two additional features compared to traditional methods: (i) the possibility of steering the optimization by imposing appropriate priors, e.g., for avoiding singularities during inverse kinematics; (ii) the possibility of acting over the environment to minimize uncertainty, e.g., with motion parallax during depth estimation.

Encoding signals in intrinsic and extrinsic reference frames also induces a decomposition over proprioceptive and exteroceptive predictions, as well as intrinsic and extrinsic dynamics functions, leading to simpler (and yet richer) attractor states. Figure 2b shows two examples of goal-directed behavior with complex kinematic structures—a 28-DoF kinematic tree and a 23-DoF human body. In these cases, an IE module of Figure 2a is employed for each DoF of the agent. Each IE module encodes the position and velocity of a specific limb, both in extrinsic (e.g., Cartesian)—

x_{e}^{(i)}

and

x_{e}^{(i)'}

—and intrinsic (e.g., polar)—

x_{i}^{(i)}

and

x_{i}^{(i)'}

—reference frames. These modules are connected hierarchically, i.e., the trunk position generates predictions for the positions of both arms and legs through the likelihood function

g_{e}^{(i)}

computing forward kinematics. The goal-directed behavior of the kinematic tree is realized by defining four reaching dynamics functions (toward the red objects) at the last levels of the hierarchy representing the end effectors. Instead, the behavior of the human body is achieved by defining repulsive dynamics functions for the extrinsic reference frames of each IE module. The decomposition of independent dynamics is also useful for tasks requiring multiple constraints in both domains, e.g., when walking with a glass in hand [36], and it has also been applied to controlling robots in 3D environments [66], or for estimating the depth of an object by moving the eyes [65]. Further, the factorial depth previously described permits representing hierarchically not only the self, but also the objects in relation to the self, along with the kinematic chains of other agents [67]. In this way, an agent could maintain a potential body configuration whenever it observes a relevant entity. This representation also accounts for the affordances of objects to be manipulated, and can be realized efficiently as soon as necessary.

3.1.3. Temporal Depth and Dynamic Planning

Consider the following discrete generative model:

p (s_{1 : τ}, o_{1 : τ}, π) = p (s_{1}) p (π) \prod_{τ} p (o_{τ} | s_{τ}) p (s_{τ} | s_{τ - 1}, π)

(40)

where:

\begin{matrix} \begin{matrix} p (s_{1}) & = C a t (D) \\ p (π) & = σ (- G) \end{matrix} & \begin{matrix} p (o_{τ} | s_{τ}) & = C a t (A) \\ p (s_{τ + 1} | s_{τ}, π) & = C a t (B_{π, τ}) \end{matrix} \end{matrix}

(41)

Here,

A

,

B

,

D

are the likelihood matrix, transition matrix, and prior,

π

is the policy,

s_{τ}

are the discrete hidden states at time

τ

,

o_{τ}

are discrete observations, and

G

is the expected free energy (see Section 2.2 for more details).

We can let the likelihood

p (o_{τ} | s_{τ})

directly bias the (discrete) hidden causes of a hybrid unit:

H_{τ} \equiv A s_{τ} o_{τ} \equiv v_{τ}

(42)

Hence, the prior

H_{τ}

over the discrete hidden causes becomes the prediction by the discrete model, while the discrete observation becomes the discrete hidden causes of the hybrid unit. This is an alternative method to the state-of-the-art, which considers an additional level between discrete observations and static priors over continuous hidden causes [37]. Here, the discrete model can impose priors over trajectories even in the same period

τ

, thus affording dynamic planning [61,67]. If a discrete model is linked to different hybrid units in parallel—as shown in Figure 3—the discrete hidden states are inferred by combining multiple pieces of evidence:

\begin{matrix} \begin{matrix} s_{π, τ} & = σ (ln B_{π, τ - 1} s_{π, τ - 1} + B_{π, τ + 1}^{T} s_{π, τ + 1} + \sum_{n} ln A^{{(i)}^{T}} v_{τ}^{(i)}) \end{matrix} \end{matrix}

(43)

where

s_{π, τ}

are the discrete hidden states conditioned over policy

π

at time

τ

, while the superscript i indicates the ith hybrid unit. These parallel pathways synchronize the behavior of all low-level units based on the same high-level plan, allowing, e.g., simultaneous coordination of every limb of the human body.

In Figure 3a, we notice the two kinds of temporal depths, peculiar to hybrid active inference. The first one comes from the discrete component and unfolds over future states

(s_{1}, s_{2}, s_{3}, \dots)

over a time horizon defined by the policy length; it allows the agent to make plans by computing the expected free energy of those states [68]. The second temporal depth derives from the continuous level and unfolds over the temporal derivatives of the hidden states, i.e.,

(x, x^{'}, x^{″}, \dots)

; this refines the estimated trajectories with an increasing sampling rate.

In addition to this, the overall deep hybrid model presents two hierarchical depths with different roles. First is a hybrid scale that separates the slow-varying representation of the discretized task with the fast update of continuous signals. Here, the temporal predictions from the continuous dynamics are used to accurately infer the discrete variables, so that the agent can make complex high-level plans even when the surrounding environment is changing frequently and is able to revise those plans when new evidence has been accumulated. Second is a continuous scale linking the hybrid units and inherent to the hierarchical representation of the agent’s kinematic structure, which can be appreciated from Figure 2. This induces a separation of temporal scales between high and low levels of the hierarchy (e.g., the trunk vs. the hands), as the predictions errors generated from the dynamics of the hand have less and less impact as they flow back to the shoulder and trunk dynamics.

Considering the pick-and-place operation shown in Figure 3b, we can encode the three key moments of the task (start position, ball picked, and ball placed) in terms of discrete hidden states

s

. At each discrete step

τ

, these states make (intrinsic and extrinsic) predictions for the hybrid units representing the agent’s kinematic chain (as in Figure 2). For instance, the second discrete hidden state generates a potential body configuration with the hand closed and at the ball position. We can use an identity mapping for the likelihood matrices, so that the intrinsic and extrinsic hidden causes of the hybrid units all have the same decomposition into three steps, i.e.,

v_{i}^{(i)} = [v_{i, s t a r t}^{(i)}, v_{i, p i c k e d}^{(i)}, v_{i, p l a c e d}^{(i)}]

and

v_{e}^{(i)} = [v_{e, s t a r t}^{(i)}, v_{e, p i c k e d}^{(i)}, v_{e, p l a c e d}^{(i)}]

. Notably, since these hidden causes are related to potential trajectories, the agent can pick and place the ball even in dynamic contexts, e.g., if the ball is moving.

3.2. A Deep Hybrid Model for Tool Use

In this section, we show how a deep hybrid model can be used efficiently in a task that requires planning in a dynamic environment and coordination of all elements of the agent’s body. The implementation details are found hereafter in Section 3.2.1, while Appendix A illustrates the algorithms for the inference of the discrete model and hybrid units. Then, in Section 3.3 we analyze model performance and describe the effects of dynamic planning in terms of accumulated sensory evidence and transitions over discrete hidden states.

Reaching an object with a tool is a complex task that requires all the features delineated in the previous section. First, the task has to be decomposed into subgoals—reaching the tool and reaching the object—which requires high-level discrete planning. Second, the agent has to maintain distinct beliefs about its arm, the tool, and the object, all of which must be inferred from sensory observations if their locations are unknown or constantly changing. Third, if the tool has to be grasped at the origin while the object has to be reached with the tool’s extremity, the agent’s generative model should encode a hierarchical representation of the self and every entity, and specify goals (in the form of attractors) at different levels of the hierarchy.

As shown in the graphical representation of the virtual environment of Figure 4a, the agent controls an arm of 4 DoF. The agent receives proprioceptive information about its joint angles, and exteroceptive (e.g., visual) observations encoding the positions of its limbs, the tool, and the ball. For simplicity, we assume that the tool sticks to the agent’s end effector as soon as it is touched. The generative model provides an effective decomposition into three parallel pathways, displayed in Figure 4c: one maintaining an estimate of the agent’s actual configuration (indicated by the subscript 0), and two others representing potential configurations in relation to the tool and the ball (indicated by the subscripts t and b, respectively). In other words, the objects of interest are not just encoded by their qualities or location, but already define a body configuration appropriate to achieve a specific interaction. In our case, the potential configuration related to the tool represents not only the estimated tool’s location, but also the estimated end effector’s location needed to reach the tool. In addition, each body configuration is composed of as many IE modules as the agent’s DoF, following the hierarchical relationships of the forward kinematics and allowing the expression of both joint angles and limb positions. Message passing of extrinsic prediction errors, i.e., differences between the estimated limb positions and their predictions given by the estimated joint angles, allows the inference of the whole body configuration (either actual or potential) via exteroceptive observations. In addition, every component of a level exchanges lateral messages with the other components, in the form of dynamics prediction errors. These errors are caused by the potential dynamics functions described in Section 3.1.1, which define the interactions between entities for goal-directed behavior.

Two crucial aspects arise when modeling entities in a deep hierarchical fashion and related to the self. First, different configurations are inferred (hence, different movements) depending on the desired interaction with the object considered—for example, reaching a ball with either the elbow or the end effector. Second, entities could have their own hierarchical structures; in our application, the tool consists of two Cartesian positions and an orientation, and the agent should somehow represent this additional link. For these reasons, we consider a virtual level for the tool configuration, attached to the last IE module (i.e., end effector), as exemplified in Figure 4c; the visual observations of the tool are then linked to the last two levels. From these observations, the correct tool angle can be inferred as if it were a new joint angle of the arm. Additionally, since we want the agent to touch the ball with the tool’s extremity, we model the third (ball) configuration with a similar structure, in which a visual observation of the ball is attached to the virtual level. The overall architecture can be better understood from Figure 4b, showing the agent’s continuous beliefs of all three entities. As soon as the agent perceives the tool, it infers a possible kinematic configuration as if it had visual access only to its last two joints (which are actually the tool’s origin and extremity). Likewise, perceiving the ball causes the agent to find an extended kinematic configuration as if the tool were part of the arm.

With this task formalization, specifying the correct dynamics for goal-directed behavior is simple. First, we define two sets of dynamics functions implementing every subgoal, one for reaching the tool and another for reaching the ball with the tool’s extremity. As explained in Section 3.2.1, the second subgoal requires specifying an attractor at the virtual level, which makes the agent think that the tool’s extremity will be pulled toward the ball. The biased state generates an extrinsic prediction error that is backpropagated to the previous level encoding the tool’s origin and the end effector. Notably, defining discrete hidden states and discrete hidden causes related to the agent’s intentions allows the agent to accumulate evidence over trajectories from different modalities (e.g., intrinsic or extrinsic) and hierarchical locations (e.g., elbow or end effector), ultimately solving the task via inference. The four main processes of the task, i.e., perception, dynamic inference, dynamic planning, and action, are summarized in Figure 5.

3.2.1. Implementation Details

The agent’s sensory modalities are (i) a proprioceptive observation

y_{p}

for the arm’s joint angles; (ii) a visual observation

y_{v}

encoding the Cartesian positions of every link of the arm, both extremities of the tool, and the ball; (iii) a discrete tactile observation

o_{t}

signaling whether or not the target is grasped.

We decompose intrinsic and extrinsic hidden states of every IE module into three components, the first one corresponding to the actual arm configuration, and the other two related to potential configurations for the tool and the ball. Hence:

\begin{matrix} \begin{matrix} x_{i}^{(i)} & = [\begin{matrix} x_{i, 0}^{(i)} & x_{i, t}^{(i)} & x_{i, b}^{(i)} \end{matrix}] \\ x_{e}^{(i)} & = [\begin{matrix} x_{e, 0}^{(i)} & x_{e, t}^{(i)} & x_{e, b}^{(i)} \end{matrix}] \end{matrix} \end{matrix}

(44)

The end effector’s level and the virtual level (see Figure 4c) are indicated with the superscripts

(4)

and

(5)

, respectively. Regarding the IE module of the virtual level, the intrinsic and extrinsic hidden states only have two components related to the potential states of the tool and the ball, i.e.,

x_{i}^{(5)} = [x_{i, t}^{(5)}, x_{i, b}^{(5)}]

and

x_{e}^{(5)} = [x_{e, t}^{(5)}, x_{e, b}^{(5)}]

.

For each entity, the intrinsic hidden states encode pairs of joint angles and limb lengths, e.g.,

x_{i, 0}^{(i)} = [θ_{0}^{(i)}, l_{0}^{(i)}]

while the extrinsic reference frame is expressed in terms of the position of a limb’s extremity and its absolute orientation, e.g.,

x_{e, 0}^{(i)} = [p_{0, x}^{(i)}, p_{0, y}^{(i)}, ϕ_{0}^{(i)}]

. The likelihood function

g_{e}

of Equation (38) computes extrinsic predictions independently for each entity:

g_{e} (x_{i}^{(i)}, x_{e}^{(i - 1)}) = [\begin{matrix} T (x_{i, 0}^{(i)}, x_{e, 0}^{(i - 1)}) & T (x_{i, t}^{(i)}, x_{e, t}^{(i - 1)}) & T (x_{i, b}^{(i)}, x_{e, b}^{(i - 1)}) \end{matrix}]

(45)

Here, the mapping

T (x_{i}, x_{e})

reduces to a simple roto-translation:

T (x_{i}, x_{e}) = [\begin{matrix} p_{x} + l c_{θ, ϕ} \\ p_{y} + l s_{θ, ϕ} \\ ϕ + θ \end{matrix}]

(46)

where

x_{i} = [θ, l]

,

x_{e} = [p_{x}, p_{y}, ϕ]

, and we used a compact notation to indicate the sine and cosine of the sum of two angles, i.e.,

c_{θ, ϕ} = cos (θ) cos (ϕ) - sin (θ) sin (ϕ)

. Each level then computes proprioceptive and visual predictions through likelihood functions

g_{p}

and

g_{v}

, which in this case are simple mappings that extract the joint angles of the actual arm configuration and the Cartesian positions of the limbs and objects from the intrinsic and extrinsic hidden states, respectively:

\begin{matrix} \begin{matrix} g_{p} (x_{i}^{(i)}) & = θ_{0}^{(i)} \\ g_{v} (x_{e}^{(i)}) & = [\begin{matrix} p_{0, x}^{(i)} & p_{t, x}^{(i)} & p_{b, x}^{(i)} \\ p_{0, y}^{(i)} & p_{t, y}^{(i)} & p_{b, y}^{(i)} \end{matrix}] \end{matrix} \end{matrix}

(47)

Reaching the tool’s origin with the end effector is achieved by a function (related to an agent’s intention) that sets the first component of the corresponding extrinsic hidden states equal to the second one:

i_{e, t}^{(4)} (x_{e}^{(4)}) = [\begin{matrix} x_{e, t}^{(4)} & x_{e, t}^{(4)} & x_{e, b}^{(4)} \end{matrix}]

(48)

Then, we define a potential dynamics function by subtracting the current hidden states from this intentional state:

\begin{matrix} \begin{matrix} f_{e, t}^{(4)} (x_{e}^{(4)}) & = i_{e, t}^{(4)} (x_{e}^{(4)}) - x_{e}^{(4)} = [\begin{matrix} x_{e, t}^{(4)} - x_{e, 0}^{(4)} & 0 & 0 \end{matrix}] \end{matrix} \end{matrix}

(49)

Note the decomposition into separate attractors. A non-zero velocity for the first component expresses the agent’s desire to move the end effector, while a zero velocity for the other two components means that the agent does not intend to manipulate the objects during the first step of the task. Since a potential kinematic configuration for the tool is already at the agent’s disposal, in order to speed up the movement, similar functions can be defined at every hierarchical level, both in intrinsic and extrinsic reference frames. The second step of the task involves reaching the ball with the tool’s extremity. Hence, we define two intentional states for the end effector’s and virtual levels, setting every component equal to the (potential) component related to the ball:

\begin{matrix} \begin{matrix} i_{e, b}^{(4)} (x_{e}^{(4)}) & = [\begin{matrix} x_{e, b}^{(4)} & x_{e, b}^{(4)} & x_{e, b}^{(4)} \end{matrix}] \\ i_{e, b}^{(5)} (x_{e}^{(5)}) & = [\begin{matrix} x_{e, b}^{(5)} & x_{e, b}^{(5)} \end{matrix}] \end{matrix} \end{matrix}

(50)

The attractors encoded in the second set of potential dynamics functions express the agent’s desire to modify the tool’s location:

\begin{matrix} \begin{matrix} f_{e, b}^{(4)} (x_{e}^{(4)}) & = i_{e, b}^{(4)} (x_{e}^{(4)}) - x_{e}^{(4)} = [\begin{matrix} x_{e, b}^{(4)} - x_{e, 0}^{(4)} & x_{e, b}^{(4)} - x_{e, t}^{(4)} & 0 \end{matrix}] \\ f_{e, b}^{(5)} (x_{e}^{(5)}) & = i_{e, b}^{(5)} (x_{e}^{(5)}) - x_{e}^{(5)} = [\begin{matrix} x_{e, b}^{(5)} - x_{e, t}^{(5)} & 0 \end{matrix}] \end{matrix} \end{matrix}

(51)

Maintaining this set of dynamics eventually drives the hand into a suitable position that makes the tool’s extremity touch the ball. A representation of the relations between such dynamics is displayed in Figure 6.

Now, we define the hidden causes of the last two levels (for simplicity, we only describe the extrinsic hidden states):

\begin{matrix} \begin{matrix} v_{e}^{(4)} & = [\begin{matrix} v_{s}^{(4)} & v_{t}^{(4)} & v_{b}^{(4)} \end{matrix}] \\ v_{e}^{(5)} & = [\begin{matrix} v_{s}^{(5)} & v_{b}^{(5)} \end{matrix}] \end{matrix} \end{matrix}

(52)

where the subscripts s, t, and b indicate the agent’s intentions to maintain the current state of the world (“stay”), reach the tool, and reach the ball, respectively. The first hidden cause, related to the following intentional states:

i_{e, s}^{(4)} (x_{e}^{(4)}) = x_{e}^{(4)} i_{e, s}^{(5)} (x_{e}^{(5)}) = x_{e}^{(5)}

(53)

is needed to ensure that

v_{e}^{(4)}

and

v_{e}^{(5)}

encode proper probabilities when the discrete model is in the initial state. The average trajectory is then found by weighting the potential dynamics functions with the corresponding hidden causes. Hence, having indicated with

μ_{e}^{(4)}

the belief of the extrinsic hidden states of the end effector, the generated trajectory is:

\begin{matrix} \begin{matrix} η_{x, e}^{' (4)} & = v_{s}^{(4)} f_{e, s}^{(4)} (μ_{e}^{(4)}) + v_{t}^{(4)} f_{e, t}^{(4)} (μ_{e}^{(4)}) + v_{b}^{(4)} f_{e, b}^{(4)} (μ_{e}^{(4)}) \end{matrix} \end{matrix}

(54)

The belief is then updated according to the following update rules:

\begin{matrix} \begin{matrix} {\dot{μ}}_{e}^{(4)} & = μ_{e}^{' (4)} - π_{e}^{(4)} ε_{e}^{(4)} + \partial g_{e}^{T} π_{e}^{(5)} ε_{e}^{(5)} + \partial g_{v}^{T} π_{v}^{(4)} ε_{v}^{(4)} + \partial η_{x, e}^{' (4) T} π_{x, e}^{(4)} ε_{x, e}^{(4)} \\ {\dot{μ}}_{e}^{' (4)} & = - π_{x, e}^{(4)} ε_{x, e}^{(4)} \end{matrix} \end{matrix}

(55)

In short, the 0th-order is subject to (i) a quantity proportional to the estimated trajectory; (ii) an extrinsic prediction error coming from the elbow, i.e.,

ε_{e}^{(4)} = μ_{e}^{(4)} - g_{e} (μ_{i}^{(4)}, μ_{e}^{(3)})

; (iii) a backward extrinsic prediction error coming from the virtual level, i.e.,

ε_{e}^{(5)} = μ_{e}^{(5)} - g_{e} (μ_{i}^{(5)}, μ_{e}^{(4)})

; (iv) a visual prediction error, i.e.,

ε_{v}^{(4)} = y_{v}^{(4)} - μ_{e}^{(4)}

; (v) a backward dynamics error encoding the generated trajectory, i.e.,

ε_{x, e}^{(4)} = μ_{e}^{' (4)} - η_{x, e}^{' (4)}

. For a more detailed treatment of inference and dynamics of kinematic configurations in hierarchical settings, see [36].

The actions

a

are instead computed by minimizing proprioceptive prediction errors:

\dot{a} = - \partial_{a} g_{p}^{T} π_{p} ε_{p}

(56)

where

\partial_{a} g_{p}

performs an inverse dynamics from proprioceptive predictions to actions.

As concerns the discrete model, its hidden states

s

express (i) whether the agent is at the tool position, at the ball position, or neither of the two; (ii) whether the agent has grasped the tool or not. These two factors combine in six process states in total. The first factor generates predictions for the extrinsic hidden causes of the hybrid units through likelihood matrices, i.e.,

A_{e}^{(4)} s

and

A_{e}^{(5)} s

. This allows the agent to synchronize the behavior of both the tool and end effector; additional likelihood matrices can be defined to impose priors for the intrinsic hidden states and at different levels of the hierarchy. The second factor returns a discrete tactile prediction, i.e.,

A_{t} s

.

Finally, we define a discrete action for each step of the task, and a transition matrix

B

such that the ball can be reached only when the tool has been grasped. Discrete actions are replanned every 10 continuous time steps, and transitions between discrete states occur dynamically depending on continuous evidence. In the example of Figure 7, the transition between reaching the tool and reaching the ball happens after 350 time steps. The extrinsic hidden causes

v_{e}^{(4)}

and

v_{e}^{(5)}

are found by Bayesian model comparison, as explained in Section 3.1.1:

\begin{matrix} \begin{matrix} v_{e}^{(4)} & = σ (ln A_{e}^{(4)} s + l_{e}^{(4)}) \\ v_{e}^{(5)} & = σ (ln A_{e}^{(5)} s + l_{e}^{(5)}) \end{matrix} \end{matrix}

(57)

As noted above,

A_{e}^{(4)} s

and

A_{e}^{(5)} s

represent predictions of extrinsic hypotheses made by the discrete model for the end effector and virtual levels, e.g., a higher value of

v_{b}^{(4)}

and

v_{b}^{(5)}

means that the discrete model wants to reach the ball. Conversely,

l_{e}^{(4)}

and

l_{e}^{(5)}

are the bottom-up messages that accumulate continuous log evidence over some time T, that is, they provide information whether the end effector is reaching the tool or the ball based on the context. Comparison between such high-level expectations and low-level evidence permits inferring the discrete hidden states based on potential trajectories. In fact, the hidden causes act as additional observations for the discrete model, which infers the states at time

τ

by combining them with the tactile observation and the hidden states at time

τ - 1

:

s_{π, τ} = σ (ln B_{π, τ - 1} s_{τ - 1} + ln A_{e}^{(4) T} v_{e, τ}^{(4)} + ln A_{e}^{(5) T} v_{e, τ}^{(5)} + ln A_{t}^{T} o_{t})

(58)

If we assume for simplicity that the agent’s preferences are encoded in a tensor

C

in terms of expected states, the expected free energy breaks down to:

G_{π} \approx \sum_{τ} s_{π, τ} [ln s_{π, τ} - ln p (s_{τ} | C)]

(59)

Computing the softmax of the expected free energy returns the posterior probability over the policies

π

, which are used to infer the new discrete hidden states at time

τ + 1

.

3.3. Analysis of Model Performances

Figure 7 illustrates task progress during a sample trial. Although both objects are moving, the discrete model successfully infers and imposes continuous trajectories allowing the agent to operate correctly and achieve its goal. At the beginning of a trial, the beliefs of the hidden states are initialized with the actual starting configuration of the arm. The agent infers two potential kinematic configurations for the tool and the ball. While the two observations of the tool constrain the corresponding inference, the ball belief is only subject to its actual position, thus letting the agent initially overestimate the length of the virtual level. During the first phase, only the tool reaching intention is active; as a consequence, the tool belief constantly biases the arm belief, which in turn pulls the real arm. After 350 steps, both these beliefs are in the same configuration, while the ball belief has inferred the corresponding position. At this point, the tool is grasped, causing the discrete model to predict a different combination of hidden causes. Now, both the tool and arm beliefs are pulled toward the ball belief. After about 800 steps, the agent infers the same configuration for all three beliefs, successfully reaching the ball with the tool’s extremity, and tracking it until the trial ends. Note that even during the first reaching movement, the agent continuously updates its configuration in relation to the ball; as a result, the second reaching movement is faster.

The transitions can be better appreciated from Figure 8a, showing the bottom-up messages (i.e., accumulated evidence)

l_{e}^{(4)}

and

l_{e}^{(5)}

for the last two levels of the hierarchy (i.e., end effector’s and virtual levels), and the discrete hidden states

s_{τ}

. As evident, the virtual level does not contribute to the inference of the first reaching trajectory, since this only involves the end effector. Note how the agent is able to dynamically accumulate the evidence over its discrete hypotheses; during the first phase, the evidence

l_{e, t}^{(4)}

(related to the tool belief at the end effector’s level) increases as soon as the end effector approaches the tool’s origin, while

l_{e, b}^{(4)}

and

l_{e, b}^{(5)}

(related to the ball belief at the end effector’s and virtual levels, respectively) decrease as the ball moves away. During the second phase, the latter two rapidly increase as the end effector approaches the ball; finally, every probability of both levels slowly stabilizes as the extrinsic beliefs converge to the same value and the errors are minimized. The slow decrease in the initial state and the fast transition between the two steps are well summarized in the bottom graph. The trajectories of the hidden states show that the agent can plan new trajectories with a high frequency (in this case, 10 continuous time steps), allowing it to react rapidly to environmental stimuli.

The relationship between hidden causes and continuous dynamics is summarized in Figure 8b, showing the extrinsic potential and estimated dynamics for the end effector’s and virtual levels, as well as the dynamics of the extrinsic hidden causes. For simplicity, we considered the same discrete hidden causes for every hierarchical level, and the top plot recapitulates the state of the whole kinematic configuration. Here, we note a similar behavior to the discrete hidden states (i.e., slow decrease in the stay cause and increase in the first reaching movement, and rapid increase in the second reaching movement in the middle of the trial). Note that the agent maintains potential dynamics related to the three intentions for the duration of the whole trial; these dynamics are combined to produce a trajectory that the motor units accomplish. In fact, two spikes are evident in the dynamics of the end effector’s level, and one in the dynamics of the virtual level, regarding the ball reaching action.

In order to assess the model performances in dynamic planning tasks, we run three different experiments, each composed of 150 trials. The first experiment assessed the capacity of picking a moving tool and reaching a static target; hence, for each trial, we varied the tool velocity. The second experiment assessed the capacity of picking a static tool and tracking a moving target; here, we varied the ball velocity. The third experiment evaluated the performances of the agent in picking a moving tool and tracking a moving target; hence, we varied both tool and ball velocities. In all experiments, we randomly sampled tool and ball positions, and their directions where relevant. Also, velocity varied from 0 to 8 pixels per time step. The width and height of the virtual environment was 1300 × 1300 pixels, twice the total arm length plus the tool length; thus, the ball and tool were out of reach for a significant period of each trial. The duration of each trial was set to 3000 steps.

The results of the simulations are visualized in Figure 9, showing the task accuracy, the time needed to complete the task, and the average final error (see the caption for more details). With static or slow-varying environments, the agent completes the task in all conditions. The first condition (moving tool) achieved good performances even with high tool velocity, although with low velocities there is a slight decrease in accuracy; since the tool often moves out of reach, the agent cannot accomplish the tool-picking action. This specific behavior is not present in the second condition (moving ball), probably due to the increased operational space which allows the agent to move along the whole environment. However, the performance decreases for high ball velocities also due to the additional difficulty of tracking moving objects. The third combined condition (moving tool and ball) achieved slightly lower performances than the second condition, with a time needed to complete the task similar to the first condition. However, the average tracking error shown in the bottom panel remains restricted even with high ball and tool velocities.

Finally, the dynamic behavior of the extrinsic beliefs can be analyzed from Figure 10, showing the trajectories, for a sample trial of the third condition, of all the forces that make up the update of Equation (55), for the last two levels and every environmental entity. The transition between the two phases of the task is evident here; the 1st-order derivative of the arm belief

μ_{e, a}^{' (4)}

(blue line in the top left panel) is non-zero during the whole task, and presents two spikes at the beginning of each phase, signaling an increased prediction error due to the new intention. The arm movement during the first phase is the consequence of the non-zero 1st-order derivative of the tool belief

μ_{e, t}^{' (4)}

(blue line in the middle left panel). The dynamics of the corresponding extrinsic prediction error

ε_{e, t}^{(5)}

(green line in the middle left panel) combines both this derivative and the visual prediction error of the next level

ε_{v, t}^{(5)}

(red line in the middle right panel). Note that this extrinsic prediction error does not exist for the arm belief, and that the backward dynamics error

ε_{e, x, a}^{(4)}

has a smaller impact on the overall update with respect to the 1st-order derivative. The second phase begins with a spike in the 1st-order derivative of the tool belief at the virtual level

μ_{e, t}^{' (5)}

(blue line in the middle right panel), which is propagated back to the previous level as an extrinsic prediction error. Finally, note that the ball belief is only subject to its visual observation and the extrinsic prediction error coming from the previous levels.

4. Discussion

We proposed a computational method, based on hybrid active inference, that affords dynamic planning for hierarchical settings. Our goal was twofold. First, to show the effectiveness of casting control problems as inference and, in particular, of expressing entities in relation to a hierarchical configuration of the self. While there could be several ways to combine the units of the proposed architecture, we showed a specific design as a proof-of-concept to solve a typical task: reaching a moving object with a tool. The agent had to rely on three kinds of depth, i.e., it had to dynamically infer its intentions for decision making, and form different hierarchical generative models depending on the structure and affordances of the entities. The proposed model unifies several characteristics studied in the active inference literature: the modeling of objects, recently performed in the context of active object reconstruction [69,70,71]; the analyses of affordances in relation to the agent’s beliefs [72]; the modeling of itinerant movements, with a behavior similar to the Lotka–Volterra dynamics implemented in [73]; planning and control in extrinsic coordinates [6,74]; inference of discrete states based on continuous signals in dynamic environments, achieved through a different kind of post hoc Bayesian model selection [75] or other various approaches such as bio-inspired SLAM [33], dynamic Bayesian networks [76], recurrent switching linear dynamical systems [77]; the use of tools for solving complex tasks [78].

Our second goal was to show that a (deep) hierarchical formulation of active inference could lend itself to learning and generalization of novel tasks. Although we used a fixed generative model, we revealed that advanced behavior is possible by using likelihood and dynamics functions that could be easily implemented with neural connections, and by decomposing the model into small units linked together. In [79], a hierarchical kinematic model was used to learn the limbs of an agent’s kinematic chain, both during perception and action. The same mechanism could be used to infer the length of tools needed for object manipulation, extending the kinematic chain in a flexible way. Therefore, an encouraging research direction would be to design a deep hybrid model in the wake of PCNs, and let the agent learn appropriate structure and internal attractors for a specific goal via free energy minimization. PCNs have demonstrated robust performance in tasks like classification and regression [46,80], while approximating the backpropagation algorithm [81,82,83,84]. However, few studies have leveraged the modular and hierarchical nature of predictive coding to model complex dynamics [47,85,86,87,88] or enable interactions with the environment [18,89,90,91,92], mostly achieved through RL. Well-known issues of deep RL are data efficiency, explainability, and generalization [45]. Instead, the human brain is capable of learning new tasks with a small amount of samples, transferring the knowledge previously acquired in similar situations. Another common criticism is that deep RL lacks explainability, which is of greater concern as AI systems rapidly grow. A viable alternative is to learn a model of the environment [93], e.g., with Bayesian non-parametrics [18]; however, these approaches are still computationally demanding. Albarracin et al. described how active inference may find an answer to the black box problem [94], and we further showed how different elements of an active inference agent have practical and interpretable meanings. In this view, optimization of parameters in hybrid models could be an effective alternative to deep RL algorithms, or other approaches in active inference relying on the use of neural networks as generative models.

Besides the fixed generative model, another limitation of the proposed study is that we only used two temporal orders, while a more complete and effective model would make use of a greater set of generalized coordinates [49]. Nonetheless, every aspect we introduced can be extended by considering increasing temporal orders. For instance, discrete variables could depend on the position, velocity, and acceleration of an object, thus inferring a more accurate representation of dynamic trajectories. Also, flexible behavior could be specified in the 2nd temporal order, resulting in a more realistic force-controlled system.

An interesting direction of research regards the generation of states and paths, about which useful indications might come from planning and control with POMDP models [18]. Some implementations of discrete active inference models used additional connections between policies and between discrete hidden states [60,95]; hence, it might be beneficial to design similar connections in continuous and hybrid contexts as well. In this study, a single high-level discrete model imposed the behavior of every other hybrid unit; an alternative would be to design independent connections between hidden causes such that a high-level decision would be propagated down to lower levels with local message passing. This approach may also provide insights into how, by repetition of the same task, discrete policies adapt to construct composite movements (e.g., a reaching and grasping action) from simpler continuous paths.

Author Contributions

M.P. designed and performed research, contributed new analytic tools, analyzed data, and wrote the paper; I.P.S. designed research and wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from the European Union’s Horizon H2020-EIC-FETPROACT-2019 Programme for Research and Innovation under Grant Agreement 951910 to I.P.S. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. This work has been carried out while MP was enrolled in the Italian National Doctorate on Artificial Intelligence run by Sapienza University of Rome in collaboration with the Institute of Cognitive Sciences and Technologies, National Research Council of Italy.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Code and data can be found at: https://github.com/priorelli/dynamic-planning (accessed on 9 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Algorithm A1 Compute expected free energy

Input:
length of policies

N_{p}

discrete hidden states

s

policies

π

transition matrix

B

preference

C

Output:
expected free energy

G

G \leftarrow 0

for each policy

π_{π}

do

s_{π, τ} \leftarrow s

for

τ = 0

to

N_{p}

do

s_{π, τ} \leftarrow B_{π, τ} s_{π, τ}

G_{π} \leftarrow G_{π} + s_{π, τ} (ln s_{π, τ} - ln C)

end for
end for

Algorithm A2 Accumulate log evidence

Input:
mean of full prior

η

mean of reduced priors

η_{m}

mean of full posterior

μ

precision of full prior

Π

precision of reduced priors

Π_{m}

precision of full posterior

P

log evidence

L_{m}

Output:
log evidence

L_{m}

for each reduced model m do

P_{m} \leftarrow P - Π + Π_{m}

μ_{m} \leftarrow P_{m}^{- 1} (P μ - Π η + Π_{m} η_{m})

L_{m} \leftarrow L_{m} + (μ_{m}^{T} P_{m} μ_{m} - η_{m}^{T} Π_{m} η_{m} - μ^{T} P_{x} μ + η^{T} Π η) / 2

end for

Algorithm A3 Active inference with deep hybrid models

Input:
continuous time T
discrete time

T

intrinsic units

U_{i}^{(i, j)}

extrinsic units

U_{e}^{(i, j)}

inverse dynamics

\partial_{a} g_{p}

proprioceptive precisions

Π_{y, p}

learning rate

Δ_{t}

,
action

a

for

t = 0

to T do
Get observations
if

t mod T = 0

then
      Update discrete model via Algorithm A4
    end if
    for each unit

U_{i}^{(i, j)}

and

U_{e}^{(i, j)}

do
      Update intrinsic unit via Algorithm A5
      Update extrinsic unit via Algorithm A6
    end for
    Get proprioceptive prediction errors

ε_{y, p}

from intrinsic units

\dot{a} \leftarrow - \partial_{a} g_{p}^{t} Π_{y, p} ε_{y, p}

a \leftarrow a + Δ_{t} \dot{a}

Take action

a

end for

Algorithm A4 Update discrete model at time

τ

Input:
discrete hidden states

s

policies

π

likelihood matrices

A^{(i, j)}

transition matrix

B

prior

D

accumulated log evidence

l^{(i, j)}

Output:
accumulated log evidence

l^{(i, j)}

discrete hidden causes

v^{(i, j)}

for each unit

U^{(i, j)}

do

v^{(i, j)} \leftarrow σ (ln A^{(i, j)} s + l^{(i, j)})

end for

s \leftarrow σ (ln D + \sum_{i, j} ln A^{(i, j) T} v^{(i, j)})

Compute expected free energy

G

via Algorithm A1

π \leftarrow σ (- G)

D \leftarrow \sum_{π} π_{π, 0} B_{π, 0} s

for each unit

U^{(i, j)}

do

v^{(i, j)} \leftarrow A^{(i, j)} D

l^{(i, j)} \leftarrow 0

end for

Algorithm A5 Update intrinsic unit

Input:
belief of extrinsic hidden states of previous level

μ_{e}^{(i - 1)}

belief of intrinsic hidden states

{\tilde{μ}}_{i}^{(i)}

intrinsic (discrete) hidden causes

v_{i}^{(i)}

proprioceptive observation

y_{p}^{(i)}

belief of extrinsic hidden states

{\tilde{μ}}_{e}^{(i)}

intrinsic dynamics (reduced) functions

f_{i, m}^{(i)}

proprioceptive likelihood

g_{p}

extrinsic likelihood

g_{e}

proprioceptive precision

Π_{y, p}^{(i)}

extrinsic precision

Π_{y, e}^{(i)}

intrinsic dynamics precision

Π_{x, i}^{(i)}

learning rate

Δ_{t}

Output:
proprioceptive prediction error

ε_{y, p}^{(i)}

extrinsic prediction error

ε_{y, e}^{(i)}

η_{x, i}^{(i)'} \leftarrow \sum_{m} v_{i, m}^{(i)} f_{i, m}^{(i)} (μ_{i}^{(i)})

ε_{y, p}^{(i)} \leftarrow y_{p}^{(i)} - g_{p} (μ_{i}^{(i)})

ε_{x, i}^{(i)} \leftarrow μ_{x, i}^{(i)'} - η_{x, i}^{(i)'}

ε_{y, e}^{(i)} \leftarrow μ_{e}^{(i)} - g_{e} (μ_{i}^{(i)}, μ_{e}^{(i - 1)})

Accumulate log evidence via Algorithm A2

{\dot{μ}}_{i}^{(i)} \leftarrow μ_{i}^{(i)'} + \partial_{i} g_{p}^{T} Π_{y, p}^{(i)} ε_{y, p}^{(i)} + \partial_{i} g_{e}^{T} Π_{y, e}^{(i)} ε_{y, e}^{(i)} + \partial_{i} η_{x, i}^{(i)' T} Π_{x, i}^{(i)} ε_{x, i}^{(i)}

{\dot{μ}}_{i}^{(i)'} \leftarrow - Π_{x, i}^{(i)} ε_{x, i}^{(i)}

{\tilde{μ}}_{i}^{(i)} \leftarrow {\tilde{μ}}_{i}^{(i)} + Δ_{t} {\dot{\tilde{μ}}}_{i}^{(i)}

Algorithm A6 Update extrinsic unit

Input:
extrinsic prediction error

ε_{y, e}^{(i)}

belief of extrinsic hidden states

{\tilde{μ}}_{e}^{(i)}

extrinsic (discrete) hidden causes

v_{e}^{(i)}

visual observation

y_{v}^{(i)}

extrinsic prediction errors of next levels

ε_{y, e}^{(i + 1, l)}

extrinsic dynamics (reduced) functions

f_{e, m}^{(i)}

visual likelihood

g_{v}

extrinsic precision

Π_{y, e}^{(i)}

extrinsic precisions of next levels

Π_{y, e}^{(i + 1, l)}

visual precision

Π_{y, v}^{(i)}

extrinsic dynamics precision

Π_{x, e}^{(i)}

learning rate

Δ_{t}

Output:
belief of extrinsic hidden states

{\tilde{μ}}_{e}^{(i)}

η_{x, e}^{(i)'} \leftarrow \sum_{m} v_{e, m}^{(i)} f_{e, m}^{(i)} (μ_{e}^{(i)})

ε_{y, v}^{(i)} \leftarrow y_{v}^{(i)} - g_{v} (μ_{e}^{(i)})

ε_{x, e}^{(i)} \leftarrow μ_{x, e}^{(i)'} - η_{x, e}^{(i)'}

Accumulate log evidence via Algorithm A2

{\dot{μ}}_{e}^{(i)} \leftarrow μ_{e}^{(i)'} - Π_{y, e}^{(i)} ε_{y, e}^{(i)} + \sum_{l} \partial_{e} g_{e}^{T} Π_{y, e}^{(i + 1, l)} ε_{y, e}^{(i + 1, l)} + \partial_{e} g_{v}^{T} Π_{y, v}^{(i)} ε_{y, v}^{(i)} + \partial_{e} η_{x, e}^{(i)' T} Π_{x, e}^{(i)} ε_{x, e}^{(i)}

{\dot{μ}}_{e}^{(i)'} \leftarrow - Π_{x, e}^{(i)} ε_{x, e}^{(i)}

{\tilde{μ}}_{e}^{(i)} \leftarrow {\tilde{μ}}_{e}^{(i)} + Δ_{t} {\dot{\tilde{μ}}}_{e}^{(i)}

References

Todorov, E. Optimality principles in sensorimotor control. Nat. Neurosci. 2004, 7, 907–915. [Google Scholar] [CrossRef] [PubMed]
Diedrichsen, J.; Shadmehr, R.; Ivry, R.B. The coordination of movement: Optimal feedback control and beyond. Trends Cogn. Sci. 2010, 14, 31–39. [Google Scholar] [CrossRef] [PubMed]
Friston, K.J.; Shiner, T.; FitzGerald, T.; Galea, J.M.; Adams, R.; Brown, H.; Dolan, R.J.; Moran, R.; Stephan, K.E.; Bestmann, S. Dopamine, Affordance and Active Inference. PLoS Comput. Biol. 2012, 8, e1002327. [Google Scholar] [CrossRef]
Friston, K.J.; Daunizeau, J.; Kiebel, S.J. Reinforcement learning or active inference? PLoS ONE 2009, 4, e6421. [Google Scholar] [CrossRef]
Friston, K. What is optimal about motor control? Neuron 2011, 72, 488–498. [Google Scholar] [CrossRef]
Friston, K.J.; Daunizeau, J.; Kilner, J.; Kiebel, S.J. Action and behavior: A free-energy formulation. Biol. Cybern. 2010, 102, 227–260. [Google Scholar] [CrossRef]
Friston, K. The free-energy principle: A unified brain theory? Nat. Rev. Neurosci. 2010, 11, 127–138. [Google Scholar] [CrossRef] [PubMed]
Parr, T.; Pezzulo, G.; Friston, K.J. Active Inference: The Free Energy Principle in Mind, Brain, and Behavior; The MIT Press: Cambridge, MA, USA, 2022. [Google Scholar] [CrossRef]
Priorelli, M.; Maggiore, F.; Maselli, A.; Donnarumma, F.; Maisto, D.; Mannella, F.; Stoianov, I.P.; Pezzulo, G. Modeling motor control in continuous-time Active Inference: A survey. IEEE Trans. Cogn. Dev. Syst. 2023, 16, 485–500. [Google Scholar] [CrossRef]
Parr, T.; Friston, K.J. Uncertainty, epistemics and active inference. J. R. Soc. Interface 2017, 14, 20170376. [Google Scholar] [CrossRef]
Rao, R.P.; Ballard, D.H. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 1999, 2, 79–87. [Google Scholar] [CrossRef]
Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 2013, 36, 181–204. [Google Scholar] [CrossRef] [PubMed]
Hohwy, J. The Predictive Mind; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
Millidge, B.; Tschantz, A.; Seth, A.K.; Buckley, C.L. On the relationship between active inference and control as inference. In Active Inference; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2020; Volume 1326, pp. 3–11. [Google Scholar]
Botvinick, M.; Toussaint, M. Planning as inference. Trends Cogn. Sci. 2012, 16, 485–488. [Google Scholar] [CrossRef] [PubMed]
Toussaint, M.; Storkey, A. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. ACM Int. Conf. Proceeding Ser. 2006, 148, 945–952. [Google Scholar] [CrossRef]
Toussaint, M. Probabilistic inference as a model of planned behavior. Künstliche Intell. 2009, 23, 23–29. [Google Scholar]
Stoianov, I.; Pennartz, C.; Lansink, C.; Pezzulo, G. Model-based spatial navigation in the hippocampus-ventral striatum circuit: A computational analysis. PLoS Comput. Biol. 2018, 14, e1006316. [Google Scholar] [CrossRef]
Friston, K. Hierarchical models in the brain. PLoS Comput. Biol. 2008, 4, e1000211. [Google Scholar] [CrossRef]
Friston, K.J.; Parr, T.; Yufik, Y.; Sajid, N.; Price, C.J.; Holmes, E. Generative models, linguistic communication and active inference. Neurosci. I Biobehav. Rev. 2020, 118, 42–64. [Google Scholar] [CrossRef]
Kandel, E.R.; Schwartz, J.H.; Jessell, T.M.; Siegelbaum, S.A.; Hudspeth, A.J. Principles of Neuroscience, 5th ed.; McGraw-Hill: New York, NY, USA, 2013. [Google Scholar]
Cardinali, L.; Frassinetti, F.; Brozzoli, C.; Urquizar, C.; Roy, A.C.; Farnè, A. Tool-use induces morphological updating of the body schema. Curr. Biol. 2009, 19, 478. [Google Scholar] [CrossRef]
Maravita, A.; Iriki, A. Tools for the body (schema). Trends Cogn. Sci. 2004, 8, 79–86. [Google Scholar] [CrossRef]
Baldauf, D.; Cui, H.; Andersen, R.A. The posterior parietal cortex encodes in parallel both goals for double-reach sequences. J. Neurosci. 2008, 28, 10081–10089. [Google Scholar] [CrossRef]
Cisek, P.; Kalaska, J.F. Neural Correlates of Reaching Decisions in Dorsal Premotor Cortex: Specification of Multiple Direction Choices and Final Selection of Action. Neuron 2005, 45, 801–814. [Google Scholar] [CrossRef] [PubMed]
Ueltzhöffer, K. Deep Active Inference. arXiv 2017, arXiv:1709.02341. [Google Scholar] [CrossRef] [PubMed]
Millidge, B. Deep active inference as variational policy gradients. J. Math. Psychol. 2020, 96, 102348. [Google Scholar] [CrossRef]
Fountas, Z.; Sajid, N.; Mediano, P.A.; Friston, K. Deep active inference agents using Monte-Carlo methods. In Proceedings of the Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Online, 6–12 December 2020. [Google Scholar]
Rood, T.; van Gerven, M.; Lanillos, P. A deep active inference model of the rubber-hand illusion. arXiv 2020, arXiv:2008.07408. [Google Scholar]
Sancaktar, C.; van Gerven, M.A.J.; Lanillos, P. End-to-End Pixel-Based Deep Active Inference for Body Perception and Action. In Proceedings of the 2020 Joint IEEE 10th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), Valparaiso, Chile, 26–30 October 2020; pp. 1–8. [Google Scholar] [CrossRef]
Champion, T.; Grześ, M.; Bonheme, L.; Bowman, H. Deconstructing deep active inference. arXiv 2023, arXiv:2303.01618. [Google Scholar]
Zelenov, A.; Krylov, V. Deep active inference in control tasks. In Proceedings of the 2021 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), Kuala Lumpur, Malaysia, 12–13 June 2021; pp. 1–3. [Google Scholar] [CrossRef]
Çatal, O.; Verbelen, T.; Van de Maele, T.; Dhoedt, B.; Safron, A. Robot navigation as hierarchical active inference. Neural Netw. 2021, 142, 192–204. [Google Scholar] [CrossRef]
Yuan, K.; Friston, K.; Li, Z.; Sajid, N. Hierarchical generative modelling for autonomous robots. Res. Sq. 2023, 5, 1402–1414. [Google Scholar] [CrossRef]
Priorelli, M.; Stoianov, I.P. Flexible Intentions: An Active Inference Theory. Front. Comput. Neurosci. 2023, 17, 1128694. [Google Scholar] [CrossRef]
Priorelli, M.; Pezzulo, G.; Stoianov, I.P. Deep kinematic inference affords efficient and scalable control of bodily movements. Proc. Natl. Acad. Sci. USA 2023, 120, e2309058120. [Google Scholar] [CrossRef]
Friston, K.J.; Parr, T.; de Vries, B. The graphical brain: Belief propagation and active inference. Netw. Neurosci. 2017, 1, 381–414. [Google Scholar] [CrossRef]
Friston, K.J.; Rosch, R.; Parr, T.; Price, C.; Bowman, H. Deep temporal models and active inference. Neurosci. Biobehav. Rev. 2017, 77, 388–402. [Google Scholar] [CrossRef] [PubMed]
Parr, T.; Friston, K.J. Active inference and the anatomy of oculomotion. Neuropsychologia 2018, 111, 334–343. [Google Scholar] [CrossRef] [PubMed]
Parr, T.; Friston, K.J. The computational pharmacology of oculomotion. Psychopharmacology 2019, 236, 2473–2484. [Google Scholar] [CrossRef] [PubMed]
Hohwy, J. New directions in predictive processing. Mind Lang. 2020, 35, 209–223. [Google Scholar] [CrossRef]
Friston, K.; Kiebel, S. Predictive coding under the free-energy principle. Philos. Trans. R. Soc. B Biol. Sci. 2009, 364, 1211–1221. [Google Scholar] [CrossRef]
Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An Introduction to Variational Methods for Graphical Models. Mach. Learn. 1999, 37, 183–233. [Google Scholar] [CrossRef]
Kobayashi, H.; Bahl, L.R. Image Data Compression by Predictive Coding I: Prediction Algorithms. IBM J. Res. Dev. 1974, 18, 164–171. [Google Scholar] [CrossRef]
Millidge, B.; Salvatori, T.; Song, Y.; Bogacz, R.; Lukasiewicz, T. Predictive Coding: Towards a Future of Deep Learning beyond Backpropagation? In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 5538–5545. [Google Scholar]
Salvatori, T.; Mali, A.; Buckley, C.L.; Lukasiewicz, T.; Rao, R.P.N.; Friston, K.; Ororbia, A. Brain-Inspired Computational Intelligence via Predictive Coding. arXiv 2023, arXiv:2308.07870. [Google Scholar]
Millidge, B.; Tang, M.; Osanlouy, M.; Harper, N.S.; Bogacz, R. Predictive coding networks for temporal prediction. PLoS Comput. Biol. 2024, 20, e1011183. [Google Scholar] [CrossRef]
Parr, T.; Friston, K.J. The Discrete and Continuous Brain: From Decisions to Movement—And Back Again Thomas. Neural Comput. 2018, 30, 2319–2347. [Google Scholar] [CrossRef]
Friston, K.; Stephan, K.; Li, B.; Daunizeau, J. Generalised Filtering. Math. Probl. Eng. 2010, 2010, 621670. [Google Scholar] [CrossRef]
Friston, K.; Da Costa, L.; Sajid, N.; Heins, C.; Ueltzhöffer, K.; Pavliotis, G.A.; Parr, T. The free energy principle made simpler but not too simple. Phys. Rep. 2022, 1024, 1–29. [Google Scholar] [CrossRef]
Adams, R.A.; Shipp, S.; Friston, K.J. Predictions not commands: Active inference in the motor system. Brain Struct. Funct. 2013, 218, 611–643. [Google Scholar] [CrossRef] [PubMed]
Pezzulo, G.; Rigoli, F.; Friston, K.J. Hierarchical Active Inference: A Theory of Motivated Control. Trends Cogn. Sci. 2018, 22, 294–306. [Google Scholar] [CrossRef]
Friston, K.J.; Frith, C.D. Active inference, communication and hermeneutics. Cortex 2015, 68, 129–143. [Google Scholar] [CrossRef]
Parr, T.; Friston, K.J. Generalised free energy and active inference. Biol. Cybern. 2019, 113, 495–513. [Google Scholar] [CrossRef]
Smith, R.; Friston, K.J.; Whyte, C.J. A step-by-step tutorial on active inference and its application to empirical data. J. Math. Psychol. 2022, 107, 102632. [Google Scholar] [CrossRef]
Da Costa, L.; Parr, T.; Sajid, N.; Veselic, S.; Neacsu, V.; Friston, K. Active inference on discrete state-spaces: A synthesis. J. Math. Psychol. 2020, 99, 102447. [Google Scholar] [CrossRef]
Friston, K.; Parr, T.; Zeidman, P. Bayesian model reduction. arXiv 2018, arXiv:1805.07092. [Google Scholar]
Friston, K.; Mattout, J.; Trujillo-Barreto, N.; Ashburner, J.; Penny, W. Variational free energy and the Laplace approximation. NeuroImage 2007, 34, 220–234. [Google Scholar] [CrossRef]
Friston, K.; Penny, W. Post hoc Bayesian model selection. NeuroImage 2011, 56, 2089–2099. [Google Scholar] [CrossRef] [PubMed]
Friston, K.J.; Parr, T.; Heins, C.; Constant, A.; Friedman, D.; Isomura, T.; Fields, C.; Verbelen, T.; Ramstead, M.; Clippinger, J.; et al. Federated inference and belief sharing. Neurosci. Biobehav. Rev. 2024, 156, 105500. [Google Scholar] [CrossRef] [PubMed]
Priorelli, M.; Stoianov, I. Dynamic Inference by Model Reduction. bioRxiv 2023. [Google Scholar] [CrossRef]
Pio-Lopez, L.; Nizard, A.; Friston, K.; Pezzulo, G. Active inference and robot control: A case study. J. R. Soc. Interface 2016, 13, 20160616. [Google Scholar] [CrossRef] [PubMed]
Priorelli, M.; Stoianov, I.P. Slow but flexible or fast but rigid? Discrete and continuous processes compared. Heliyon 2024, 10, e39129. [Google Scholar] [CrossRef]
Adams, R.A.; Aponte, E.; Marshall, L.; Friston, K.J. Active inference and oculomotor pursuit: The dynamic causal modelling of eye movements. J. Neurosci. Methods 2015, 242, 1–14. [Google Scholar] [CrossRef]
Priorelli, M.; Pezzulo, G.; Stoianov, I. Active Vision in Binocular Depth Estimation: A Top-Down Perspective. Biomimetics 2023, 8, 445. [Google Scholar] [CrossRef]
Pezzato, C.; Buckley, C.; Verbelen, T. Why learn if you can infer? Robot arm control with Hierarchical Active Inference. In Proceedings of the The First Workshop on NeuroAI @ NeurIPS2024, Vancouver, BC, Canada, 14–15 December 2024. [Google Scholar]
Priorelli, M.; Stoianov, I.P. Dynamic planning in hierarchical active inference. Neural Netw. 2025, 185, 107075. [Google Scholar] [CrossRef]
Friston, K.; Da Costa, L.; Hafner, D.; Hesp, C.; Parr, T. Sophisticated inference. Neural Comput. 2021, 33, 713–763. [Google Scholar] [CrossRef]
Ferraro, S.; de Maele, T.V.; Mazzaglia, P.; Verbelen, T.; Dhoedt, B. Disentangling Shape and Pose for Object-Centric Deep Active Inference Models. arXiv 2022, arXiv:2209.09097. [Google Scholar]
van Bergen, R.S.; Lanillos, P.L. Object-based active inference. arXiv 2022, arXiv:2209.01258. [Google Scholar]
Van de Maele, T.; Verbelen, T.; undefinedatal, O.; Dhoedt, B. Embodied Object Representation Learning and Recognition. Front. Neurorobotics 2022, 16, 840658. [Google Scholar] [CrossRef] [PubMed]
Donnarumma, F.; Costantini, M.; Ambrosini, E.; Friston, K.; Pezzulo, G. Action perception as hypothesis testing. Cortex 2017, 89, 45–60. [Google Scholar] [CrossRef] [PubMed]
Friston, K.J.; Mattout, J.; Kilner, J. Action understanding and active inference. Biol. Cybern. 2011, 104, 137–160. [Google Scholar] [CrossRef]
Oliver, G.; Lanillos, P.; Cheng, G. An empirical study of active inference on a humanoid robot. IEEE Trans. Cogn. Dev. Syst. 2021, 14, 462–471. [Google Scholar] [CrossRef]
Isomura, T.; Parr, T.; Friston, K. Bayesian filtering with multiple internal models: Toward a theory of social intelligence. Neural Comput. 2019, 31, 2390–2431. [Google Scholar] [CrossRef]
Nozari, S.; Krayani, A.; Marin-Plaza, P.; Marcenaro, L.; Gomez, D.M.; Regazzoni, C. Active Inference Integrated with Imitation Learning for Autonomous Driving. IEEE Access 2022, 10, 49738–49756. [Google Scholar] [CrossRef]
Collis, P.; Singh, R.; Kinghorn, P.F.; Buckley, C.L. Learning in Hybrid Active Inference Models. arXiv 2024, arXiv:2409.01066. [Google Scholar]
Anil Meera, A.; Lanillos, P. Towards Metacognitive Robot Decision Making for Tool Selection. In Proceedings of the Active Inference, Ghent, Belgium, 13–15 September 2023; Buckley, C.L., Cialfi, D., Lanillos, P., Ramstead, M., Sajid, N., Shimazaki, H., Verbelen, T., Wisse, M., Eds.; Springer: Cham, Switerland, 2024; pp. 31–42. [Google Scholar]
Priorelli, M.; Stoianov, I.P. Efficient Motor Learning Through Action-Perception Cycles in Deep Kinematic Inference. In Proceedings of the Active Inference; Springer Nature: Cham, Switzerland, 2024; pp. 59–70. [Google Scholar] [CrossRef]
Ororbia, A.; Kifer, D. The neural coding framework for learning generative models. Nat. Commun. 2022, 13, 2064. [Google Scholar] [CrossRef]
Whittington, J.C.R.; Bogacz, R. An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity. Neural Comput. 2017, 29, 1229–1262. [Google Scholar] [CrossRef]
Whittington, J.C.; Bogacz, R. Theories of Error Back-Propagation in the Brain. Trends Cogn. Sci. 2019, 23, 235–250. [Google Scholar] [CrossRef] [PubMed]
Millidge, B.; Tschantz, A.; Buckley, C.L. Predictive Coding Approximates Backprop Along Arbitrary Computation Graphs. Neural Comput. 2022, 34, 1329–1368. [Google Scholar] [CrossRef] [PubMed]
Salvatori, T.; Song, Y.; Lukasiewicz, T.; Bogacz, R.; Xu, Z. Predictive Coding Can Do Exact Backpropagation on Convolutional and Recurrent Neural Networks. arXiv 2021, arXiv:2103.03725. [Google Scholar]
Stoianov, I.; Maisto, D.; Pezzulo, G. The hippocampal formation as a hierarchical generative model supporting generative replay and continual learning. Prog. Neurobiol. 2022, 217, 102329. [Google Scholar] [CrossRef]
Jiang, L.P.; Rao, R.P.N. Dynamic predictive coding: A model of hierarchical sequence learning and prediction in the neocortex. PLoS Comput. Biol. 2024, 20, e1011801. [Google Scholar] [CrossRef]
Nguyen, T.; Shu, R.; Pham, T.; Bui, H.; Ermon, S. Temporal Predictive Coding For Model-Based Planning In Latent Space. arXiv 2021, arXiv:2106.07156. [Google Scholar]
Tang, M.; Barron, H.; Bogacz, R. Sequential Memory with Temporal Predictive Coding. arXiv 2023, arXiv:2305.11982. [Google Scholar]
Millidge, B. Combining Active Inference and Hierarchical Predictive Coding: A Tutorial Introduction and Case Study. PsyArXiv 2019. [Google Scholar]
Ororbia, A.; Mali, A. Active Predicting Coding: Brain-Inspired Reinforcement Learning for Sparse Reward Robotic Control Problems. arXiv 2022, arXiv:2209.09174. [Google Scholar]
Rao, R.P.N.; Gklezakos, D.C.; Sathish, V. Active Predictive Coding: A Unified Neural Framework for Learning Hierarchical World Models for Perception and Planning. arXiv 2022, arXiv:2210.13461. [Google Scholar]
Fisher, A.; Rao, R.P.N. Recursive neural programs: A differentiable framework for learning compositional part-whole hierarchies and image grammars. PNAS Nexus 2023, 2, pgad337. [Google Scholar] [CrossRef] [PubMed]
Moerland, T.M.; Broekens, J.; Plaat, A.; Jonker, C.M. Model-based Reinforcement Learning: A Survey. arXiv 2022, arXiv:2006.16712. [Google Scholar]
Albarracin, M.; Hipólito, I.; Tremblay, S.E.; Fox, J.G.; René, G.; Friston, K.; Ramstead, M.J.D. Designing explainable artificial intelligence with active inference: A framework for transparent introspection and decision-making. arXiv 2023, arXiv:2306.04025. [Google Scholar]
de Maele, T.V.; Verbelen, T.; Mazzaglia, P.; Ferraro, S.; Dhoedt, B. Object-Centric Scene Representations using Active Inference. arXiv 2023, arXiv:2302.03288. [Google Scholar] [CrossRef] [PubMed]

Figure 1. (a) Factor graph of a hybrid unit. Continuous hidden states

\tilde{x}

generate predictions

\tilde{y}

through parallel pathways. Model dynamics is encoded by potential trajectories

f_{m}

, which are hypotheses of how the world may evolve and are associated with discrete hidden causes

v

. (b) Illustrative example of a hybrid unit. In this task, the agent has to infer which one among two objects (a red circle and a gray square moving along a circular trajectory) is being tracked by another 1-DoF agent. The time step is shown in the bottom left of each frame. The hidden states

\tilde{x}

encode the angle and angular velocity of the arm (generating proprioceptive predictions), as well as the positions and velocities of the two objects (generating visual predictions). The blue arrow represents the actual hand trajectory, while the red and green arrows represent the two potential trajectories associated with reaching movements toward the two objects. See [61] for more details.

Figure 1. (a) Factor graph of a hybrid unit. Continuous hidden states

\tilde{x}

generate predictions

\tilde{y}

through parallel pathways. Model dynamics is encoded by potential trajectories

f_{m}

, which are hypotheses of how the world may evolve and are associated with discrete hidden causes

v

. (b) Illustrative example of a hybrid unit. In this task, the agent has to infer which one among two objects (a red circle and a gray square moving along a circular trajectory) is being tracked by another 1-DoF agent. The time step is shown in the bottom left of each frame. The hidden states

\tilde{x}

encode the angle and angular velocity of the arm (generating proprioceptive predictions), as well as the positions and velocities of the two objects (generating visual predictions). The blue arrow represents the actual hand trajectory, while the red and green arrows represent the two potential trajectories associated with reaching movements toward the two objects. See [61] for more details.

Figure 2. (a) An IE module is composed of two units

U_{i}

and

U_{e}

, which represent a signal in intrinsic and extrinsic reference frames, respectively. Different IE modules can be combined in a hierarchical fashion; the extrinsic signal

x_{e}^{(i)}

is iteratively transformed through linear transformation matrices encoded in the extrinsic likelihood function

g_{e}^{(i)}

. Hierarchical levels communicate via the 0th-order hidden states. (b) Illustrative examples of a hierarchical model with IE modules. In the first task, the agent (a 23-DoF human body) has to avoid a moving obstacle; in the second task, the agent (a 28-DoF kinematic tree) has to reach four target locations with the extremities of its branches. In both cases, the module in (a) is repeated for every DoF of the agents, matching their kinematic structures. Proprioceptive and exteroceptive (e.g., visual) for each DoF are, respectively, generated by the intrinsic and extrinsic units via appropriate likelihood functions. See [36] for more details.

Figure 2. (a) An IE module is composed of two units

U_{i}

and

U_{e}

, which represent a signal in intrinsic and extrinsic reference frames, respectively. Different IE modules can be combined in a hierarchical fashion; the extrinsic signal

x_{e}^{(i)}

is iteratively transformed through linear transformation matrices encoded in the extrinsic likelihood function

g_{e}^{(i)}

. Hierarchical levels communicate via the 0th-order hidden states. (b) Illustrative examples of a hierarchical model with IE modules. In the first task, the agent (a 23-DoF human body) has to avoid a moving obstacle; in the second task, the agent (a 28-DoF kinematic tree) has to reach four target locations with the extremities of its branches. In both cases, the module in (a) is repeated for every DoF of the agents, matching their kinematic structures. Proprioceptive and exteroceptive (e.g., visual) for each DoF are, respectively, generated by the intrinsic and extrinsic units via appropriate likelihood functions. See [36] for more details.

Figure 3. (a) Interface between a discrete model and several hybrid units. The hidden causes

v^{(i)}

are directly generated, in parallel pathways, from discrete hidden states

s_{τ}

via likelihood matrices

A^{(i)}

. (b) Illustrative example with the hybrid units combined with a discrete model. In this task, the agent (a 4-DoF arm with an additional 4-DoF hand composed of two fingers) has to pick a moving ball (the red circle) and place it at a goal position (the grey square). The discrete hidden states

s_{τ}

encode the agent position (start position, at the ball, or at the goal) and the status of the hand (open or closed). These are informed by two continuous models encoding intrinsic (joint angles) and extrinsic (hand and object positions) information, respectively. The hidden causes

v

of the intrinsic model are related to hand opening and closing actions, while the hidden causes of the extrinsic model relate to two reaching movements, as in the previous case. Note that the object belief (purple circle) is rapidly inferred, and as soon as the picking action is complete, the belief is gradually pulled toward the goal position, resulting in a second reaching movement. The top right panel shows the hand–object distance over time, while the bottom right panel displays the dynamics of the discrete action probabilities used to infer the next discrete state. The vertical dashed lines distinguish five different phases: a pure reaching movement, an intermediate phase when the agent prepares the grasping action, a grasping phase, a second reaching movement and, finally, the ball release. The stepped behavior of the action probabilities is due to the replanning made by the discrete model every 10 continuous time steps. See [63] for more details.

Figure 3. (a) Interface between a discrete model and several hybrid units. The hidden causes

v^{(i)}

are directly generated, in parallel pathways, from discrete hidden states

s_{τ}

via likelihood matrices

A^{(i)}

. (b) Illustrative example with the hybrid units combined with a discrete model. In this task, the agent (a 4-DoF arm with an additional 4-DoF hand composed of two fingers) has to pick a moving ball (the red circle) and place it at a goal position (the grey square). The discrete hidden states

s_{τ}

encode the agent position (start position, at the ball, or at the goal) and the status of the hand (open or closed). These are informed by two continuous models encoding intrinsic (joint angles) and extrinsic (hand and object positions) information, respectively. The hidden causes

v

of the intrinsic model are related to hand opening and closing actions, while the hidden causes of the extrinsic model relate to two reaching movements, as in the previous case. Note that the object belief (purple circle) is rapidly inferred, and as soon as the picking action is complete, the belief is gradually pulled toward the goal position, resulting in a second reaching movement. The top right panel shows the hand–object distance over time, while the bottom right panel displays the dynamics of the discrete action probabilities used to infer the next discrete state. The vertical dashed lines distinguish five different phases: a pure reaching movement, an intermediate phase when the agent prepares the grasping action, a grasping phase, a second reaching movement and, finally, the ball release. The stepped behavior of the action probabilities is due to the replanning made by the discrete model every 10 continuous time steps. See [63] for more details.

Figure 4. (a) Virtual environment of the tool use task. An agent controlling a 4-DoF arm has to grasp a moving tool (in green) and reach a moving ball (in red) with the tool’s extremity. (b) Agent’s beliefs over the continuous hidden states of the arm (blue), tool (light green), and ball (light red). The real positions of the tool and ball are represented in dark green and dark red, respectively. The virtual level is plotted with more transparent colors. (c) Graphical representation of the agent’s continuous generative model. Every environmental entity is encoded hierarchically by considering the whole arm’s kinematic structure. For clarity, the three pathways are displayed separately, while lateral connections and the high-level discrete model are not shown. The end effector’s level encodes intrinsic and extrinsic information about the end effector, regarding the three configurations (the actual end effector position, the belief over the end effector at the tool’s origin, or at an appropriate position to reach the ball with the tool’s extremity). Instead, the virtual level is not present in the actual configuration, since the tool is not part of the agent’s kinematic chain and it is only used in the generative model for goal-directed behavior, as if it were a new joint. This level encodes intrinsic and extrinsic information about the tool, regarding the two potential configurations (the belief over the tool’s extremity at the actual tool’s extremity, and at the actual ball position). Small purple and yellow circles represent proprioceptive and exteroceptive observations, respectively.

Figure 5. Graphical representation of a deep hybrid model for tool use, composed of a discrete model at the top and several IE modules. Every module is factorized into three elements, related to the observations of the agent’s arm (in blue), a tool (in green), and a ball (in red). Note that the last (virtual) level only considers the tool’s extremity and the ball. The computation of the action for a single time step is divided into four main processes. (a) Perception. Proprioceptive and visual observations

y_{p}

and

y_{e}

are compared with the agent’s predictions. The resulting prediction errors are propagated throughout the hierarchy to infer the actual kinematic configuration, as well as potential configurations related to the objects. (b) Dynamic inference. The bottom-up messages

l_{e}

from the IE modules inform the discrete model about the most likely state that may have generated the perceived arm trajectory. This is achieved by comparing the latter with potential trajectories

f_{m}

related to dynamic hypotheses

v_{e}

(see Equation (57)). For instance, if the agent is reaching the tool and the ball is moving away, the bottom-up messages assign a higher probability to the tool-reaching hypothesis and a lower probability to the initial steady state. (c) Dynamic planning. The agent infers the next discrete action to take by minimizing the expected free energy

G

(see Equation (59)). As a result, the agent believes itself to be at the next discrete state, corresponding to the ball-reaching hypothesis. In turn, this biased state generates a new combined trajectory (through the discrete extrinsic prediction

A_{e} s

in Equation (57)), acting as a prior for the continuous hidden states of the IE modules. (d) Action. The continuous hidden states generate predictions, which are again compared with the related observations. The proprioceptive prediction errors climb back up the hierarchy as before, but they are also suppressed through movement by motor units (see Equation (56)). This second process eventually produces a continuous action that moves the end effector toward the ball.

Figure 5. Graphical representation of a deep hybrid model for tool use, composed of a discrete model at the top and several IE modules. Every module is factorized into three elements, related to the observations of the agent’s arm (in blue), a tool (in green), and a ball (in red). Note that the last (virtual) level only considers the tool’s extremity and the ball. The computation of the action for a single time step is divided into four main processes. (a) Perception. Proprioceptive and visual observations

y_{p}

and

y_{e}

are compared with the agent’s predictions. The resulting prediction errors are propagated throughout the hierarchy to infer the actual kinematic configuration, as well as potential configurations related to the objects. (b) Dynamic inference. The bottom-up messages

l_{e}

from the IE modules inform the discrete model about the most likely state that may have generated the perceived arm trajectory. This is achieved by comparing the latter with potential trajectories

f_{m}

related to dynamic hypotheses

v_{e}

(see Equation (57)). For instance, if the agent is reaching the tool and the ball is moving away, the bottom-up messages assign a higher probability to the tool-reaching hypothesis and a lower probability to the initial steady state. (c) Dynamic planning. The agent infers the next discrete action to take by minimizing the expected free energy

G

(see Equation (59)). As a result, the agent believes itself to be at the next discrete state, corresponding to the ball-reaching hypothesis. In turn, this biased state generates a new combined trajectory (through the discrete extrinsic prediction

A_{e} s

in Equation (57)), acting as a prior for the continuous hidden states of the IE modules. (d) Action. The continuous hidden states generate predictions, which are again compared with the related observations. The proprioceptive prediction errors climb back up the hierarchy as before, but they are also suppressed through movement by motor units (see Equation (56)). This second process eventually produces a continuous action that moves the end effector toward the ball.

Figure 6. Representation of the dynamics active during the two steps. (a) The end effector (dark blue) is pulled toward the belief about the tool’s origin (light green) through

f_{e, t}^{(4)}

. (b) The tool’s extremity (dark green) is pulled toward the ball (dark red) through dynamics

f_{e, b}^{(5)}

. This generates an extrinsic prediction error

ε_{e}^{(5)}

that steers the previous level of the potential configuration of the tool. Concurrently, a second dynamics

f_{e, b}^{(4)}

also pulls both actual and tool components of the end effector’s level toward the potential configuration of the ball (light red).

Figure 6. Representation of the dynamics active during the two steps. (a) The end effector (dark blue) is pulled toward the belief about the tool’s origin (light green) through

f_{e, t}^{(4)}

. (b) The tool’s extremity (dark green) is pulled toward the ball (dark red) through dynamics

f_{e, b}^{(5)}

. This generates an extrinsic prediction error

ε_{e}^{(5)}

that steers the previous level of the potential configuration of the tool. Concurrently, a second dynamics

f_{e, b}^{(4)}

also pulls both actual and tool components of the end effector’s level toward the potential configuration of the ball (light red).

Figure 7. Sequence of time frames of the simulation. Real ball, tool, and arm are displayed in dark red, dark green, and dark blue, respectively. Beliefs of tool and ball, in terms of potential kinematic configurations, are shown in light green and light red, respectively. Trajectories of the end effector, tool, and ball are displayed as well. The number of time steps is shown in the lower-left corner of each frame.

Figure 8. (a) Normalized log evidence, for 60 discrete steps

τ

(composed, in turn, of 10 continuous time steps), of the end effector’s level

l_{e}^{(4)}

(top), and virtual level

l_{e}^{(5)}

(middle). Discrete hidden states (bottom). The green and red dashed lines, respectively, represent the tool’s extremity–ball distance, and the tool’s origin–end effector distance, normalized to fit in the plots. As explained in Section 3.2.1, the agent has two discrete states and two hidden causes related to the steps of the task, i.e., reaching the ball and reaching the tool, with additional stay discrete state and cause. (b) Dynamics of extrinsic hidden causes

v_{e}

(top plot). Norm of extrinsic potential dynamics

| | f_{e, s}^{(4)} | |

(stay),

| | f_{e, t}^{(4)} | |

(reach tool) and

| | f_{e, b}^{(4)} | |

(reach ball), along with estimated dynamics

| | μ_{e}^{(4)'} | |

for end effector’s level (middle plot). Norm of extrinsic potential dynamics

| | f_{e, s}^{(5)} | |

(stay) and

| | f_{e, b}^{(5)} | |

(reach ball), along with estimated dynamics

| | μ_{e}^{(5)'} | |

for virtual level (bottom plot). The dynamics are plotted for 600 continuous time steps.

Figure 8. (a) Normalized log evidence, for 60 discrete steps

τ

(composed, in turn, of 10 continuous time steps), of the end effector’s level

l_{e}^{(4)}

(top), and virtual level

l_{e}^{(5)}

(middle). Discrete hidden states (bottom). The green and red dashed lines, respectively, represent the tool’s extremity–ball distance, and the tool’s origin–end effector distance, normalized to fit in the plots. As explained in Section 3.2.1, the agent has two discrete states and two hidden causes related to the steps of the task, i.e., reaching the ball and reaching the tool, with additional stay discrete state and cause. (b) Dynamics of extrinsic hidden causes

v_{e}

(top plot). Norm of extrinsic potential dynamics

| | f_{e, s}^{(4)} | |

(stay),

| | f_{e, t}^{(4)} | |

(reach tool) and

| | f_{e, b}^{(4)} | |

(reach ball), along with estimated dynamics

| | μ_{e}^{(4)'} | |

for end effector’s level (middle plot). Norm of extrinsic potential dynamics

| | f_{e, s}^{(5)} | |

(stay) and

| | f_{e, b}^{(5)} | |

(reach ball), along with estimated dynamics

| | μ_{e}^{(5)'} | |

for virtual level (bottom plot). The dynamics are plotted for 600 continuous time steps.

Figure 9. Performances of the deep hybrid model during tool use for three conditions: a moving tool (in red), a moving ball (in green), and moving tool and ball (in blue). Accuracy (top), for which we considered a trial successful if the tool was picked and the average ball–tool distance for the last 300 steps was less than 100 pixels. Time (middle), measured as the number of steps needed for the tool to be picked and for the ball–tool distance to be less than 100 pixels. Error (bottom), measured as the average ball–tool distance for the last 300 steps. For each condition, we aggregated the measures for 150 trials. The middle and bottom plots also show the 95% confidence interval.

Figure 10. Trajectory of every component of the extrinsic belief updates for the end effector (left panels) and virtual (right panels) levels. Every environmental entity is shown separately. Blue, orange, green, red, and purple lines, respectively, indicate the 1st-order derivative, the extrinsic prediction error from the previous level, the extrinsic prediction error from the next level, the visual prediction error, and the backward dynamics error.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Priorelli, M.; Stoianov, I.P. Deep Hybrid Models: Infer and Plan in a Dynamic World. Entropy 2025, 27, 570. https://doi.org/10.3390/e27060570

AMA Style

Priorelli M, Stoianov IP. Deep Hybrid Models: Infer and Plan in a Dynamic World. Entropy. 2025; 27(6):570. https://doi.org/10.3390/e27060570

Chicago/Turabian Style

Priorelli, Matteo, and Ivilin Peev Stoianov. 2025. "Deep Hybrid Models: Infer and Plan in a Dynamic World" Entropy 27, no. 6: 570. https://doi.org/10.3390/e27060570

APA Style

Priorelli, M., & Stoianov, I. P. (2025). Deep Hybrid Models: Infer and Plan in a Dynamic World. Entropy, 27(6), 570. https://doi.org/10.3390/e27060570

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Hybrid Models: Infer and Plan in a Dynamic World

Abstract

1. Introduction

2. Methods

2.1. Predictive Coding

2.2. Hierarchical Active Inference

2.3. Bayesian Model Comparison

3. Results

3.1. Deep Hybrid Models

3.1.1. Factorial Depth and Flexible Behavior

3.1.2. Hierarchical Depth and Iterative Transformations

3.1.3. Temporal Depth and Dynamic Planning

3.2. A Deep Hybrid Model for Tool Use

3.2.1. Implementation Details

3.3. Analysis of Model Performances

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI