Directed Equilibrium Propagation Revisited

Costa, Pedro; Santos, Pedro A.

doi:10.3390/math13111866

Open AccessArticle

Directed Equilibrium Propagation Revisited

by

Pedro Costa

^1,*,†

and

Pedro A. Santos

^1,2,*,†

¹

Instituto Superior Técnico, University of Lisbon, 1049-001 Lisbon, Portugal

²

INESC-ID, 1000-029 Lisbon, Portugal

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(11), 1866; https://doi.org/10.3390/math13111866

Submission received: 17 April 2025 / Revised: 18 May 2025 / Accepted: 23 May 2025 / Published: 3 June 2025

(This article belongs to the Special Issue Mathematics and Applications)

Download

Browse Figures

Versions Notes

Abstract

Equilibrium Propagation (EP) offers a biologically inspired alternative to backpropagation for training recurrent neural networks, but its reliance on symmetric feedback connections and stability limitations hinders practical adoption. The DirEcted EP (DEEP) model relaxes the symmetry constraint, yet suffers from convergence issues and lacks a principled learning guarantee. In this work, we generalize DEEP by incorporating neuronal leakage, providing new convergence criteria for the network’s dynamics. We additionally propose a novel local learning rule closely linked to the objective function’s gradient and establish sufficient conditions for reliable learning in small networks. Our results resolve longstanding stability challenges and bring energy-based learning models closer to biologically plausible and provably effective neural computation.

Keywords:

recurrent neural networks; equilibrium propagation; biologically plausible algorithms

MSC:

37J25; 37N25; 68T07; 92B20

1. Introduction

Artificial neural networks are a type of artificial intelligence computing system modeled on the human brain and nervous system. The mathematical theory of neural networks explores how mathematical models can be used to understand and manipulate neural networks. In this regard, statistical learning theory is a valuable tool for studying the behavior of neural networks [1]. However, this paper will focus on a specific learning algorithm that is a biologically plausible alternative to the well-known backpropagation algorithm. We will use for this purpose some results from dynamical systems theory.

The basic building block of all artificial neural networks (ANNs) is the Neuron. Mathematically, a neuron is a function

s, R^{n} \to R

,

n \in N

, parameterized by a weight vector

w \in R^{n + 1}

and defined by

s (x) = ρ (\sum_{i = 1}^{n} w_{i} x_{i} + w_{0}),

(1)

where

ρ

is a suitable, usually continuous real function, known as the activation function. A feed-forward neural network is built by the composition of many (perhaps billions) of such simple functions, organized in layers. Networks with more than three layers are known in the literature as deep neural networks. For instance, the ANN that was trained to play the game of Chess or Go with superhuman performance, Alpha-zero, is constituted by more than 80 layers, with each layer having thousands of neurons [2,3]. For other applications, like natural language understanding or translation, recurrent networks are used. These networks are not necessarily organized by layers and thus do not form an acyclic graph.

The goal of training such an artificial neural network is to find the correct parameters so that the function implemented by the ANN satisfies certain conditions or minimizes some objective function. For example, in supervised learning, a training set of input–output pairs is used to adjust the parameters until the error between the output of the ANN and the true values is minimized, using gradient descent in the parameter space. Because the error is propagated from the output layer backwards, the algorithm used is known as backpropagation (see, for instance, [4], Chapter 5).

The backpropagation approach to supervised learning stands out as the most successful algorithm for training artificial neural networks. But despite ANNs being originally bio-inspired, backpropagation is widely considered to be biologically implausible [5]. Among other reasons, it (i) lacks local error representation, (ii) uses distinct forward and backward information passes and (iii) requires symmetric feedback weights. One path towards bridging the gap between biology and machine learning is thus to explore other learning paradigms, closer to biological reality.

Equilibrium propagation (EP) is an energy-based model for recurrent neural networks, which operates by minimizing an energy function, uses a local learning rule and just one kind of neural computation, satisfying restrictions (i) and (ii) [5]. This model considers continuous-time neural dynamics, and so is a departure from backpropagation, and makes discrete per-layer updates. However, it also requires symmetric feedback weights to work.

The original idea was followed by several advances addressing the challenges of convergence, scalability, and hardware efficiency while reinforcing theoretical ties to traditional backpropagation.

Laborieux et al. [6] extended EP to deep convolutional networks using symmetric nudging; their work on CIFAR-10 reports an 11.7% test error and lower memory use, though convergence is about 20% slower than backpropagation through time. Martin et al. [7] introduced the EqSpike algorithm for spiking networks and achieved 97.6% accuracy on MNIST. Foroushani et al. [8] implemented EP on an analog circuit and reported a 250-fold acceleration in the relaxation process relative to a Python baseline. Kiraz et al. [9] focused on parameter optimization by analyzing the effects of feedback current and learning rate, while Laborieux and Zenke [10] proposed holomorphic EP, which computes exact gradients using finite teaching signals and matches backpropagation performance on an ImageNet 32 × 32 benchmark.

The problem of symmetric feedback weights was tackled with the introduction of the DirEcted EP (DEEP) model, which allows for asymmetric feedback connections while abandoning the need for a global energy function [11]. The issue in doing so is that the convergence of the neuronal dynamics is no longer assured, because there is no energy function guiding the network’s behavior. In addition, the weight update rule, while biologically inspired, does not seem to have any ties to the gradient of the objective function, and learning is thus, in general, not guaranteed.

The main contribution of the present work is a generalization of the DEEP model by adding leakage to non-input neurons, which solves the stability issues found in the earlier work. New conditions for the convergence of the neuronal dynamics of DEEP’s inference phase are established, and a different local weight update rule is proposed, with close ties to the gradient of the objective function. Moreover, sufficient conditions for a small-sized network to learn following the proposed weight update rule are also determined.

The remainder of the paper is organized as follows. In Section 2, common notation and the EP and DEEP models are introduced. Section 3 goes over the stability problems in the DEEP model. In Section 4, our generalization of the DEEP model is presented along with new conditions for convergence of the inference and learning phases. Section 5 discusses a new learning rule and its ties to the gradient of the objective function. Section 6 concludes the paper.

2. Equilibrium Propagation and Directed Equilibrium Propagation

In this section, we begin by defining some notation to be used throughout this text. Then, we present both the original EP model and the DEEP model.

2.1. Notation

In Equation (1),

w_{0}

is commonly referred to as the neuron’s bias. In this text, the bias is represented as a fixed neuron (neuron 0) that maintains a state of one and connects to every non-input neuron. Then, the weight associated with the connection from this neuron to a neuron i becomes neuron i’s bias.

For an architecture consisting of a total of

N + 1

neurons, where P are input neurons, L are hidden neurons, and K are output neurons, both EP and DEEP are entirely described by the following elements:

A state vector $s (t) = {[s_{j} (t)]}_{j = 0}^{N} \in {[0, 1]}^{N}$ , representing the neuronal activities. This state vector is composed of sub-vectors corresponding to the (fixed) input $x = {[s_{j} (t)]}_{j = 1}^{P}$ , the hidden neurons $h (t)$ , and the output neurons $\hat{y} (t) = {[s_{j} (t)]}_{j = N - K + 1}^{N}$ , where $s_{0} (t) \equiv 1$ corresponds to the bias neuron. The value $s_{j}$ can be biologically interpreted as the firing rate of neuron j;
A weight matrix $W = {[W_{i j}]}_{i, j = 0}^{N}$ , where $W_{i j}$ represents the weight of the connection from neuron i to j;
A system of continuous-time differential equations that governs the dynamics of the network.

From this point onward, the time dependency of the state variable will be omitted to simplify the notation.

2.2. Equilibrium Propagation

Originally introduced in [5], EP is a learning framework for energy-based models that employs a network of recurrently connected neurons with symmetric weights. Its goal is to provide a more biologically plausible alternative to conventional ANNs by circumventing the backpropagation algorithm.

This model is driven by an energy function F, which is the central object of its behavior: all quantities of interest (fixed points, cost function, objective function, gradient formula) can be defined or formulated directly in terms of F.

2.2.1. The Network

The quantity

W_{i j} = W_{j i}

is the weight associated with the connection between neurons i and j. Self-loops are excluded, meaning

W_{i i} = 0

for all i, and input neurons do not form connections among themselves, i.e.,

W_{i j} = 0

for all

i, j \in {1, \dots, P}

.

In Figure 1, the architecture of a three-neuron network of the EP model is represented. The state vector is

s = {[s_{i}]}_{i = 0}^{3}

, containing four elements: a constant bias term

[s_{0}] = [1]

, an input term

x = [s_{1}]

, a hidden term

h = [s_{2}]

and an output term

\hat{y} (t) = [s_{3}]

. Each

s_{j}

corresponds to the activity of the

j^{t h}

neuron. The synaptic connections are described by the weight matrix

W = {[W_{i j}]}_{i, j = 0}^{3}

(note that

W_{01}

is omitted from Figure 1 since 1 is an input neuron and so

W_{01} = 0

).

2.2.2. Auxiliary Functions

There are some functions that need to be defined in order to establish the dynamics of the network. The internal energy function, E, depends only on the state of the network and is defined as

E (s (t)) = \frac{1}{2} \sum_{i = 0}^{N} s_{i}^{2} (t) - \frac{1}{2} \sum_{i = 0}^{N} \sum_{j = 0}^{N} W_{i j} s_{i} (t) s_{j} (t) .

(2)

It should be noted that the notation in this equation differs slightly from the one presented in [5]. In particular, the notation used was simplified due to the alternative version of stochastic gradient descent employed (detailed in Equation (8)), under which

s_{j} = ρ (s_{j})

holds for all j.

The external energy function C, which corresponds to the cost function, is the mean squared error and is defined as

C (s, y) = \frac{1}{2} \sum_{j = P + L + 1}^{N} {(s_{j} (t) - y_{j})}^{2},

(3)

where

y

is the vector containing the target output for the training input. The clamping factor, denoted by

β \in R_{0}^{+}

, is the parameter that determines the extent to which the target output y influences the network. Incorporating this “influence parameter”, the total energy function F is then defined as

F (s (t), y, β) = E (s (t)) + β C (s (t), y) .

(4)

Finally, the loss function J is defined as the external energy at the equilibrium state

s^{★} = s (T_{0})

, which is achieved after adjusting the neurons over the time interval

[0, T_{0}]

to minimize the internal energy function.

2.2.3. The Dynamics and Training Algorithm

The EP dynamics are defined by a set of differential equations. The activity of each adjustable neuron,

s_{j}

,

j \in {P + 1, \dots, N}

, evolves in the negative direction of the gradient of the total energy,

\frac{d s_{j}}{d t} = - \frac{\partial F (s (t), y, β)}{\partial s_{j}} = - \frac{\partial E (s (t))}{\partial s_{j}} - β \frac{\partial C (s (t), y)}{\partial s_{j}},

(5)

where

\frac{\partial F (s (t), y, β)}{\partial s_{j}} = s_{j} - \sum_{i = 0}^{N} W_{i j} s_{i} + β (s_{j} - y_{j}) 1_{\hat{y}} (s_{j}),

(6)

Recall that the indicator function is defined as

1_{A} (x) = \{\begin{matrix} 1, & x \in A \\ 0, & x \notin A \end{matrix} .

and

s_{j} \in \hat{y} \Leftrightarrow j \in {P + L + 1, \dots, N} \Leftrightarrow j

is an output neuron. As a result, the total energy of the system is guaranteed to decrease over time. To promote convergence and keep the firing rates confined within the interval

[0, 1]

, a different form of gradient descent is used in the neuronal update rule. Take

d t

as the smallest time interval considered between iterations. At iteration step

m + 1

, the standard gradient descent update rule takes the form

s_{j} (m + 1) ⟵ s_{j} (m) + d t (\sum_{i = 0}^{N} W_{i j} s_{i} (m) - s_{j} (m) - β (s_{j} (m) - y_{j}) 1_{\hat{y}} (s_{j})) .

(7)

However, in what follows, the update rule is modified to

s_{j} (m + 1) ⟵ ρ (s_{j} (m) + d t (\sum_{i = 0}^{N} W_{i j} s_{i} (m) - s_{j} (m) - β (s_{j} (m) - y_{j}) 1_{\hat{y}} (s_{j}))) .

(8)

The training algorithm has two distinct phases, corresponding to setting

β = 0

(first phase) and

β > 0

(second phase).

The first phase, which corresponds to the inference process, consists of, given a fixed input expressed by the values taken by the input neurons, allowing the network to evolve toward an equilibrium state with respect to Equation (5) with

β = 0

. In the second phase, commonly referred to as the learning phase, a small increment in

β

introduces a perturbation to the system. This slight adjustment causes the output neurons to shift their activities closer to the target values, which, in turn, propagates through the network. The system then relaxes into a new equilibrium state with respect to (5) that reduces the loss function.

Let us denote the first and second equilibrium states by

s^{0}

and

s^{β}

. The weight updates are then computed based on the gradients of the loss function, which, in the limit

β \to 0

, solely depend on

s^{0}

and

s^{β}

:

Δ W_{i j} = s_{i}^{β} s_{j}^{β} - s_{i}^{0} s_{j}^{0} .

(9)

This Equation for the loss gradient of

W_{i j}

was proved in [5].

2.3. DirEcted Equilibrium Propagation (DEEP)

DEEP was proposed in [11] and consists of a generalization of the original EP model. Specifically, DEEP “seeks to improve the biological plausibility of the original EP model, by removing all structural restrictions and enabling the architecture of the model to be a complete directed (asymmetric connections) graph”. Figure 2 shows an example of such a Neural Network.

Due to the lack of symmetry of the synaptic weights, DEEP is not energy-based. This is because its neuronal dynamics are governed by vector fields, which are not gradient fields.

Fixing (clamping) the input neurons’ states, the neuronal dynamics proposed is

\frac{d s_{j}}{d t} = V_{j}^{β} (s, y, β) = \sum_{i = 0}^{N} W_{i j} s_{i} - s_{j} \sum_{i = 0}^{N} W_{j i} - β (s_{j} (t) - y_{j} (t)) 1_{\hat{y}} (s_{j}),

(10)

where

y

is the vector of target/desired outputs and the cost function, C, is identical to the one specified in (3) for the original formulation. The training algorithm also consists of two distinct phases: the first phase, where

β = 0

, and the second phase, where

β \neq 0

. During both phases, the network reaches equilibrium states, denoted by

s^{0}

in the first phase and

s^{β}

in the second. The objective function is defined as the cost when the network is at its first equilibrium state,

J (x, y) : = C (s^{0}, y)

. In the inference phase, the activities of the input neurons are fixed, and the network evolves to the equilibrium state

s^{0}

, from which the output is read at the corresponding output neurons

\hat{y}

.

The learning rule used is

Δ W_{i j} \propto \frac{1}{M_{β}} \sum_{m = M_{0} + 1}^{M_{0} + M_{β}} s_{i} (m) (s_{j} (m) - s_{j} (m - 1)),

(11)

where

M_{0}

and

M_{β}

are the number of steps in a discretization of the first and second phases, respectively, and

s_{i} (m)

is the ith neuron’s state after m steps.

General Stability Test

Consider a nonlinear, time-invariant dynamical system with state variable

s \in R^{N}

,

\frac{d s (t)}{d t} = f (s (t)),

(12)

where

f : R^{N} \to R^{N}

is a nonlinear continuously differentiable function (a

C^{1}

map).

Definition 1

(Equilibrium state). An equilibrium state

s^{★}

of the system represented by Equation (12) is one for which

f (s^{★}) = 0

.

Definition 2

(Stable in the sense of Lyapunov). A system is said to be stable in the sense of Lyapunov around

s^{★}

if, for any

ϵ > 0

, there exists

δ (ϵ) > 0

such that if

∥ s (t_{0}) - s^{★} ∥ < δ (ϵ)

, then

∥ s (t) - s^{★} ∥ < ϵ

for all

t > t_{0}

.

Definition 3

(Locally asymptotically stable). A system is said to be locally asymptotically stable around

s^{★}

if it is stable in the sense of Lyapunov and there exists

γ (t_{0}) > 0

such that, if

∥ s (t_{0}) - s^{★} ∥ < γ (t_{0})

, then

{lim}_{t \to \infty} s (t) = s^{★}

.

The following result, which concerns the stability of nonlinear dynamical systems, is stated and proved in [11]:

Theorem 1

(Farinha). Let

s^{★}

be an equilibrium state of Equation (12),

J \in R^{N \times N}

the Jacobian matrix of f evaluated at

s^{★}

and

R_{j} = \sum_{i = 1, i \neq j} | J_{j i} |, j \in {1, \dots, N}

. If, for

j \in {1, \dots, N}

, the following two conditions are satisfied

1.: $J_{j j} < 0$ ;
2.: $R_{j} < | J_{j j} |$ ;

then

s^{★}

is locally asymptotically stable.

It yields, as a corollary, sufficient conditions for the stability of the networks’ dynamics during an inference phase:

Corollary 1

(DEEP Inference Sufficient Conditions for Stability Verification). Let

s^{0}

be an equilibrium state with respect to the dynamics given by Equation (10). If, for

j \in {P + 1, \dots, N}

, the following two conditions are satisfied:

1.: $\sum_{i = 0}^{N} W_{j i} > 0$ ;
2.: $\sum_{i = P + 1}^{N} | W_{i j} | < | \sum_{i = 0}^{N} W_{j i} |$ ;

then

s^{0}

is locally asymptotically stable.

3. Stability Issues of the DEEP Model

In this section, two problems regarding the stability of the inference and learning phases of the DEEP model [11] are presented. The first one concerns the mere existence of stable states with regard to Equation (10) when

β = 0

, while the second one concerns Corollary 1.

3.1. Inexistence of Stable States

During an inference phase,

β = 0

, so the networks’ dynamics is dictated by

V_{j} (s (t)) = \sum_{i = 0}^{N} W_{i j} s_{i} (t) - s_{j} (t) \sum_{i = 0}^{N} W_{j i} .

(13)

In [11], it is noted that a property of DEEP is that, in the absence of external stimuli (when the input and bias neurons have fixed activities), the total sum of firing-rates remains constant over time (i.e.,

\sum_{j = 1}^{N} \dot{s_{j}} (t) = 0, t \in R_{0}^{+}

). Despite this, in an inference phase, the input neurons are clamped (have fixed firing rates), and so the sum of the time derivatives of the firing rates is:

\begin{matrix} \sum_{j = 0}^{N} \dot{s_{j}} & = \sum_{j = P + 1}^{N} \dot{s_{j}} = \sum_{j = P + 1}^{N} (\sum_{i = 0}^{N} W_{i j} s_{i} - s_{j} \sum_{i = 0}^{N} W_{j i}) = \end{matrix}

(14)

\begin{matrix} = \sum_{j = P + 1}^{N} \sum_{i = 0}^{N} W_{i j} s_{i} - \sum_{j = P + 1}^{N} \sum_{i = 0}^{N} W_{j i} s_{j} = \end{matrix}

(15)

\begin{matrix} = (\sum_{j = P + 1}^{N} \sum_{i = P + 1}^{N} W_{i j} s_{i} - \sum_{j = P + 1}^{N} \sum_{i = P + 1}^{N} W_{j i} s_{j}) + \sum_{j = P + 1}^{N} \sum_{i = 0}^{P} W_{i j} s_{i} - \sum_{j = P + 1}^{N} \sum_{i = 0}^{P} W_{j i} s_{j} . \end{matrix}

(16)

The two terms inside the parentheses cancel each other out, and since

W_{j i} = 0

\forall i \leq P

, the last term is 0. We are left with

\sum_{j = 0}^{N} \dot{s_{j}} = \sum_{i = 0}^{P} \sum_{j = P + 1}^{N} W_{i j} s_{i} = constant .

(17)

As a consequence, without an activation function, the neuronal dynamics in (10) do not have any equilibrium state (its existence requires it to be possible for

\sum_{j = 0}^{N} \dot{s_{j}}

to equal 0).

In [11], this was avoided with the use of the activation function

ρ : R \to [0, 1]

, defined as

ρ (s) = \{\begin{matrix} 0, & s < 0 \\ s, & 0 \leq s \leq 1 \\ 1, & s > 1 \end{matrix} .

(18)

This function, which is continuous, and differentiable almost everywhere, is known as the hard-sigmoid and constrains the neuron’s activity to lie within the interval

[0, 1]

. Nevertheless, the neurons in the DEEP model would saturate (i.e., have a high tendency to only stabilize with an activity of either 0 or 1). Following the tradition in neural networks literature, we will represent by the same symbol the map obtained when applying

ρ

to each entry of a vector.

3.2. On the Assumptions of Corollary 1

In [11], Theorem 1 is proved using the combination of two major results: Lyapunov’s indirect method and Gershgorin Circle’s Theorem, which states that every eigenvalue of a matrix A lies within at least one of the Gershgorin circles

D_{i} = {z \in C : | z - A_{i i} | \leq R_{i}}

, where

R_{i} = \sum_{j = 1, j \neq i}^{N} | A_{i j} |

[12]. The former guarantees that, linearizing the system around an equilibrium state

s

, in order to establish the local asymptotic stability of

s

, one must verify that all eigenvalues of the Jacobian matrix J of V (evaluated at

s

) have strictly negative real parts. The latter provides sufficient conditions for the real parts of every eigenvalue of J to be strictly negative.

The matrix J is:

J = [\begin{matrix} - \sum_{i = 0}^{N} W_{P + 1, i} & W_{P + 2, P + 1} & \dots & W_{N, P + 1} \\ W_{P + 1, P + 2} & - \sum_{i = 0}^{N} W_{P + 2, i} & \dots & W_{N, P + 2} \\ ⋮ & ⋱ & ⋮ \\ W_{P + 1, N} & \dots & W_{N - 1, N} & - \sum_{i = 0}^{N} W_{N, i} \end{matrix}] .

(19)

Since

W_{j i} = 0

\forall i \leq P

, the sum of the entries of each column of J is 0. As such, the vector

v = (1, 1, \dots, 1)

satisfies

J^{T} \cdot v = 0 \cdot v

, meaning

λ = 0

is an eigenvalue of

J^{T}

and thus of J, regardless of W.

This suggests there should not be any possible conditions given by the Gershgorin Circle Theorem guaranteeing that all eigenvalues of J have strictly negative real parts (because

λ = 0

does not satisfy that requirement). So, the following result holds:

Lemma 1.

It is impossible to simultaneously satisfy both conditions stated in Corollary 1.

Proof.

Considering the first condition and the fact that

W_{j i} = 0

\forall i \leq P

, the second condition yields:

\sum_{i = P + 1}^{N} | W_{i j} | < |\sum_{i = 0}^{N} W_{j i}| = \sum_{i = 0}^{N} W_{j i} = \sum_{i = P + 1}^{N} W_{j i} .

(20)

Summing this inequality with j varying from

P + 1

to N, we obtain

\begin{matrix} \sum_{j = P + 1}^{N} \sum_{i = P + 1}^{N} | W_{i j} | & < \sum_{j = P + 1}^{N} \sum_{i = P + 1}^{N} W_{j i} \end{matrix}

(21)

\begin{matrix} \Leftrightarrow \sum_{j = P + 1}^{N} \sum_{i = P + 1}^{N} | W_{j i} | & < \sum_{j = P + 1}^{N} \sum_{i = P + 1}^{N} W_{j i}, \end{matrix}

(22)

which is a contradiction. □

In this section, we showed that the neuronal dynamics in the DEEP model, without an activation function, has no equilibrium state. Furthermore, Lemma 1 shows that Corollary 1 (from [11]) is a vacuously true statement: it claims that given Conditions 1 and 2,

s^{0}

is locally asymptotically stable yet those Conditions are impossible to be met simultaneously.

4. Generalizing the DEEP Model

One way to improve the DEEP model and overcome the problem described in the previous Section is to add a leakage term to the neurons. There are several ways of represent this in the model: one would be to introduce a new

leakage

neuron

N + 1

and letting

W_{i (N + 1)}

be the ith neuron’s leakage term; another, that simplifies the mathematical treatment of the model, is to allow the connections ending in input neurons to be different from 0.

Since these neurons are clamped in both phases of training, they are unaffected by this change. As to each of the remaining neurons, they gain a leakage term

{leak}_{(i)} : = \sum_{j = 0}^{P} W_{i j}

(which was previously equal to 0).

The dynamics of the network remain the same as in Equation (10), except now the first

P + 1

terms of the sum

\sum_{i = 0}^{N} W_{j i}

are not necessarily 0. We note that, by adding leakage to the model

Both problems mentioned in the previous Section are solved: the sum of the derivatives of the neurons’ activations, during an inference phase, is not constant anymore, and the conditions provided by Corollary 1 are no longer impossible to hold. In particular, since in [11] it was nowhere used that $W_{i j} = 0, j \leq P$ , Corollary 1 still holds.
The input neurons stop being different from any other neurons: on a structural level, what distinguished them was having no incoming connections.

We now present results attained for the stability of the altered DEEP model, which incorporates the newly added leakage and may shed light on this model’s need for leakage. We then show that these new stability conditions and Corollary 1 are also enough to ensure stability in the subsequent learning phase (

β > 0

). Finally, an interesting link between the Corollary 1 and the new conditions is described.

4.1. New Stability Conditions

Let us consider the discretized system, with

d t > 0

as the smallest time interval considered, and let

s (n)

be the system’s state at the instant

n \cdot d t

. Then, following the same version of stochastic gradient descent considered in Equation (8), now in vectorial form, we have

s (n + 1) = ρ (s (n) + d t \cdot V (s (n))) .

This means we can view an inference phase as a fixed point iteration of the function

Φ

, defined as

\begin{matrix} Φ : R^{N + 1} & \to R^{N + 1} \\ s (n) & \mapsto s (n + 1) = ρ (s (n) + d t \cdot V (s (n))) . \end{matrix}

An equilibrium state with respect to the network’s dynamics simply becomes a fixed point of

Φ

. We assume

ρ

is continually differentiable and

ρ (R) \subseteq [0, 1]

(see end of this subsection). We will need the following theorems (see, for instance, [13,14]):

Theorem 2.

Let X be a convex set in

R^{n}

and

Φ : X \subset R^{n} \to R^{n}

a

C^{1}

function in

R^{n}

. Letting

J_{Φ}

be the Jacobian of Φ, (The matrix norm 1 of a matrix A is defined as

{∥ A ∥}_{1} = {max}_{j = 1, \dots, n} \sum_{i = 1}^{n} | a_{i j} | .

) if

sup_{x \in X} {∥ J_{Φ} (x) ∥}_{1} < 1,

(23)

then the function Φ is contractive in X.

Theorem 3

(Fixed point theorem in

R^{n}

). Let E be a finite-dimensional normed space and X a closed convex subset of E. Let Φ be a contractive function in X, such that

Φ (X) \subset X .

Then, the following statements are valid:

(1): Φ has a unique fixed point z in X.
(2): If ${(x^{(k)})}_{k \geq 0}$ is the succession of terms in E such that $x^{(0)} \in X$ and

$x^{(k + 1)} = Φ (x^{(k)}), \forall k \geq 0,$

then ${(x^{(k)})}_{k \geq 0}$ converges to z.

We take

E = R^{(L + K)}

and

X = {[0, 1]}^{(L + K)}

, which clearly satisfy the conditions of Theorem 3. Considering that the activation function

ρ

restricts neurons’ activations to the interval

[0, 1]

,

Φ (X) \subset X

is immediate. All we have to verify is that

Φ

is contractive in X. We are now ready to prove the following result:

Theorem 4

(DEEP Sufficient Conditions for Stability of an Inference Phase). The inference phase of the DEEP model converges to an equilibrium state if the activation function ρ used is such that

| ρ^{'} (x) | \leq 1, \forall x \in R

, and for

j \in {P + 1, \dots, N}

,

{leak}_{(j)} > 2 \cdot \sum_{\begin{matrix} i = P + 1 \\ W_{j i} < 0 \end{matrix}}^{N} | W_{j i} |,

(24)

where

{leak}_{(j)} = \sum_{i = 0}^{P} W_{j, i}

.

Proof.

For any suitable function F, let

J_{F}

denote its Jacobian matrix. By Theorem 2,

sup_{x \in X} {∥ J_{Φ} (x) ∥}_{1} < 1,

(25)

suffices to guarantee convergence of DEEP’s inference phase. Let

x \in X

. Since

Φ = ρ \circ (I + d t V)

, where I is the identity operator, we have

{∥ J_{Φ} (x) ∥}_{1} = {∥ J_{ρ} ((I + d t V) (x)) J_{(I + d t V)} (x) ∥}_{1} \leq {∥ J_{ρ} ((I + d t V) x) ∥}_{1} {∥ J_{(I + d t V)} (x) ∥}_{1} .

(26)

Since

\frac{\partial ρ_{i}}{\partial x_{j}} = 0

if

i \neq j

, and

| ρ^{'} (x) | \leq 1

,

J_{ρ}

is a diagonal matrix and

{∥ J_{ρ} (x) ∥}_{1} \leq 1

.

Due to the linearity of the vector field V defined in (13),

J_{(I + d t V)}

is independent of x. This means

{∥ J_{(I + d t V)} (x) ∥}_{1} < 1

is enough to ensure (25), since we are taking the supremum over the product of

{∥ J_{ρ} ((I + d t V) x) ∥}_{1}

, which is not greater than 1, and

{∥ J_{(I + d t V)} (x) ∥}_{1}

, a constant real number strictly less than 1.

We perform the calculations for the

j^{t h}

column of

J_{(I + d t V)} = I + d t J_{V}

(

J_{V}

is matrix (19)):

| 1 - d t \sum_{i = 0}^{N} W_{P + j, i} | + \sum_{i = P + 1}^{N} | d t \cdot W_{P + j, i} | < 1 .

(27)

For sufficiently small

d t

,

1 - d t \sum_{i = 0}^{N} W_{P + j, i} > 0

and we can remove the absolute value of the first term. We are left with

\begin{matrix} 1 - d t (\sum_{i = 0}^{N} W_{P + j, i} & - \sum_{i = P + 1}^{N} | W_{P + j, i} |) < 1 \end{matrix}

(28)

\begin{matrix} \Leftrightarrow \sum_{i = 0}^{N} W_{P + j, i} & - \sum_{i = P + 1}^{N} | W_{P + j, i} | > 0 \end{matrix}

(29)

\begin{matrix} \Leftrightarrow \sum_{i = 0}^{P} W_{P + j, i} & > 2 \sum_{\begin{matrix} i = P + 1 \\ W_{P + j, i} < 0 \end{matrix}}^{N} | W_{P + j, i} |, \end{matrix}

(30)

where, in the last step, we cancelled the weights

W_{i j}

that appeared on both sums. The left term is familiar and can be written as:

{leak}_{(P + j)} > 2 \cdot \sum_{\begin{matrix} i = P + 1 \\ W_{P + j, i} < 0 \end{matrix}}^{N} | W_{P + j, i} |

(31)

or, in general, for

j \in {1, \dots, N - P}

,

{leak}_{(P + j)} > 2 \cdot \sum_{\begin{matrix} i = P + 1 \\ W_{P + j, i} < 0 \end{matrix}}^{N} | W_{P + j, i} | .

(32)

□

This result can be interpreted as the DEEP model needing leakage to compensate for the existence of weights with negative values.

The sigmoid function or the hyperbolic tangent are examples of commonly used activation functions which satisfy Theorem 4’s statement. Note that the activation function introduced in (18) is not

C^{1}

, however, it is

C^{1}

almost everywhere, and so arbitrarily good approximations by differentiable functions can be used.

4.2. On the Stability During Learning Phase ( $β > 0$ )

As previously stated, both Corollary 1 and Theorem 4 also ensure convergence of the learning phase, as can be seen in this Section.

4.2.1. DEEP Model

In Section 3.2, it was mentioned that in [11], Theorem 1 was proved using the Gershgorin Circle Theorem to guarantee all eigenvalues of matrix J had strictly negative real part. This was done by imposing the restriction that each circle is centered on the left half of the complex plane, with a radius smaller than the distance from its center to the origin, i.e.,

J_{j j} < 0 & R_{j} < | J_{j j} |, j \in {1, \dots, N},

(33)

where

R_{i} = \sum_{j = 1, j \neq i}^{N} | A_{i j} |

is the

j^{t h}

Gershgorin circle’s radius.

Let K be the number of output neurons. Noting that J is a square matrix of size

N - P

, and the cost function used is the mean squared error, when

β > 0

, the only change to J is that

β

is subtracted to the diagonal entries corresponding to output neurons (i.e.,

J_{i i}

with

N - P - K < i \leq N - P

).

Assuming Corollary 1’s conditions were satisfied during an inference phase of the DEEP model, then we simply have to check they still are for

(N - K) < j \leq N

:

$\sum_{i = 0}^{N} W_{i j} > 0 ⟹ \sum_{i = 0}^{N} W_{i j} + β > 0$
Given that $β > 0$ and the previous condition holds, $| \sum_{i = 0}^{N} W_{i j} + β | > | \sum_{i = 0}^{N} W_{i j} | > \sum_{i = P + 1}^{N} | W_{i j} |$ .

Therefore, if the Corollary 1’s stability conditions held during DEEP’s inference phase, they still do during its subsequent learning phase.

4.2.2. Theorem 4

The same is true with the new stability conditions defined in Theorem 4. Going back to Theorem 4’s proof, and again considering the only change in matrix J is that

β

is subtracted from the diagonal entries corresponding to output neurons, we now have

| 1 - d t \sum_{i = 0}^{N} W_{P + j, i} - β | + \sum_{i = P + 1}^{N} | d t \cdot W_{P + j, i} | < 1 .

(34)

However, since

0 < β \leq 1

, it can be argued that

| 1 - d t \sum_{i = 0}^{N} W_{P + j, i} - β | \leq | 1 - d t \sum_{i = 0}^{N} W_{P + j, i} | .

(35)

This is because if

0 < β < 1

, for sufficiently small

d t

, (35) becomes

1 - d t \sum_{i = 0}^{N} W_{P + j, i} - β \leq 1 - d t \sum_{i = 0}^{N} W_{P + j, i} \Leftrightarrow β \geq 0 .

(36)

If

β = 1

, again for sufficiently small

d t

, (35) becomes

| d t \sum_{i = 0}^{N} W_{P + j, i} | \leq 1 - d t \sum_{i = 0}^{N} W_{P + j, i},

(37)

which is trivial given

d t

is, again, sufficiently small. Either way, (35) holds and

\begin{matrix} | 1 - d t \sum_{i = 0}^{N} W_{P + j, i} - β | + \sum_{i = P + 1}^{N} | d t \cdot W_{P + j, i} | \leq \\ | 1 - d t \sum_{i = 0}^{N} W_{P + j, i} | + \sum_{i = P + 1}^{N} | d t \cdot W_{P + j, i} | < 1, \end{matrix}

(38)

where the last inequality comes from our assumption that the network satisfied Theorem 4’s conditions for stability.

4.3. Remark on the Link Between New and Previous Stability Conditions

Theorem 2 also holds if the matrix norm ∞ is used instead (see [13,14]). Then, the same reasoning used in proving Theorem 4 leads one to recover exactly the previous stability conditions (given by Corollary 1). Similarly, one can adapt Theorem 1’s proof. Recall that the Gershgorin Circle Theorem was used to ensure all the eigenvalues of matrix (19) had negative real parts. If we apply the same reasoning to the transpose of matrix (19), which has exactly the same eigenvalues, we get instead the new stability conditions. Two very different approaches to the same problem seem to yield the same results.

In this section, we presented the new DEEP model with leakage and proved stability conditions for both the learning and inference phases. In the next section, we will additionally introduce a learning rule inspired by [15] and analyze its relationship with the gradient of the objective function.

5. Generalization of Equilibrium Propagation to Vector Field Dynamics

In [15], in an attempt to generalize EP to vector field dynamics, the authors used a learning rule which, in the limit

β \to 0

, relates to the gradient

\frac{\partial J}{\partial W} (x, y, W)

in a manner made explicit by Theorem 5 (see below Section 5.1).

In this section, the main steps followed in [15] to build up to the new learning rule are presented, as well as the special case of EP. Then, we apply the same reasoning to DEEP, producing a learning rule whose effectiveness we explore for a small network of

N = 4

neurons.

5.1. On the Learning Rule

In [15], the authors consider the simplified Hebbian update rule based on pre- and post-synaptic activity given by

d W_{i j} \propto ρ (s_{i}) d s_{j} .

(39)

Since for their vector field $μ$ , it holds that

ρ (s_{i}) = \frac{\partial μ_{j}}{\partial W_{i j}} (s)

and for all

j^{'} \neq j

,

\frac{\partial μ_{j^{'}}}{\partial W_{i j}} = 0

, Equation (39) can be written in the concise form (It is important to note that in [15] and in the DEEP model,

μ

and V are vector fields from

R^{L + K}

to

R^{L + K}

, acting on the hidden and output neurons. For that reason, in this section, the vector state s only contains the activities of those

L + K

neurons)

d W \propto \frac{\partial μ}{\partial W} {(W, s)}^{T} \cdot d s .

(40)

The weight update is assumed to occur during the second phase of training (the learning phase), when the network’s state transitions from the first fixed point

s^{0}

to the second fixed point

s^{β}

. By integrating (40) from

s^{0}

to

s^{β}

, normalizing by a factor

β

, and then taking the limit as letting

β \to 0

, we arrive at the update rule

Δ W \propto ν (W)

, where

ν (W)

is the vector defined as

ν (W) = \frac{\partial μ}{\partial W} {(W, s^{0})}^{T} \cdot \frac{\partial s^{β}}{\partial β} (0) .

(41)

Formally,

ν (W)

is a vector field in the weight space. The following result is proved in [15]:

Theorem 5.

The gradient

\frac{\partial J}{\partial W} (x, y, W)

and the vector field

ν (W)

can be expressed explicitly in terms of μ and C:

\frac{\partial J}{\partial W} (x, y, W) = - \frac{\partial C}{\partial s} (s^{0}) \cdot {(\frac{\partial μ}{\partial s} (s^{0}))}^{- 1} \cdot \frac{\partial μ}{\partial W} (s^{0})

(42)

ν (W) = \frac{\partial C}{\partial s} (s^{0}) {(\frac{\partial μ}{\partial s} {(s^{0})}^{- 1})}^{T} \cdot \frac{\partial μ}{\partial W} (s^{0}),

(43)

where

s^{0}

is the equilibrium state reached in the inference phase.

This result shows

ν (W)

is related to

\frac{\partial J}{\partial W} (W)

and that the angle between these two vectors is directly related to the “degree of symmetry” of the Jacobian of

μ

in the fixed point

s^{0}

.

Particular Case of the Original EP

In the original EP [5], we have

ν (W_{i j}) = \frac{\partial μ}{\partial W_{i j}} {(s^{0})}^{T} \cdot \frac{\partial s^{β}}{\partial β} (0) = \frac{\partial s_{i}^{β}}{\partial β} (0) s_{j}^{0} + s_{i}^{0} \frac{\partial s_{j}^{β}}{\partial β} (0) .

(44)

Since

W_{i j} = W_{j i}

for all

i, j

, the matrix

\frac{\partial μ}{\partial s} (s)

is symmetric and by Theorem 5,

ν (W)

should coincide with the update rule used in [5], given by:

Δ W_{i j} \propto lim_{β \to 0} \frac{s_{i}^{β} s_{j}^{β} - s_{i}^{0} s_{j}^{0}}{β} .

(45)

Under mild regularity conditions on

μ

and C (recall C is the cost function), the implicit function theorem ensures that, for fixed input and output

(x, y)

, the function

β \mapsto s^{β}

is differentiable. So, we use Cauchy’s rule to evaluate (45):

\begin{matrix} lim_{β \to 0} \frac{s_{i}^{β} s_{j}^{β} - s_{i}^{0} s_{j}^{0}}{β} & = lim_{β \to 0} \frac{\frac{\partial s_{i}^{β}}{\partial β} s_{j}^{β} + s_{i}^{β} \frac{\partial s_{j}^{β}}{\partial β}}{1} = lim_{β \to 0} (\frac{\partial s_{i}^{β}}{\partial β} s_{j}^{β} + s_{i}^{β} \frac{\partial s_{j}^{β}}{\partial β}) = \end{matrix}

(46)

\begin{matrix} = \frac{\partial s_{i}^{β}}{\partial β} (0) s_{j}^{0} + s_{i}^{0} \frac{\partial s_{j}^{β}}{\partial β} (0) = ν (W_{i j}) . \end{matrix}

(47)

We have thus shown that, despite appearing to be different, the update rules

ν (W)

and the one used in [5] are the same.

5.2. $ν (W)$ in the DEEP Model

When recreating the steps followed in [15] to reach

ν (W)

in the DEEP model, replacing the vector field

μ

for V, instead of Equation (39) one should start from

d W_{i j} \propto ρ (s_{i}) d s_{j} - ρ (s_{i}) d s_{i} = ρ (s_{i}) (d s_{j} - d s_{i}) .

(48)

Note that due to the different version of the gradient descent considered (see Equation (8)),

s_{i} = ρ (s_{i})

and, similarly to Equation (40), it follows that

d W \propto \frac{\partial V}{\partial W} {(W, s)}^{T} \cdot d s .

(49)

Then, the same reasoning produces the update rule

ν (W_{i j}) = s_{i}^{0} \cdot \frac{\partial s_{j}^{β}}{\partial β} (0) - s_{i}^{0} \cdot \frac{\partial s_{i}^{β}}{\partial β} (0) = lim_{β \to 0} s_{i}^{0} \frac{(s_{j}^{β} - s_{j}^{0}) - (s_{i}^{β} - s_{i}^{0})}{β},

(50)

which is related to

\frac{\partial J}{\partial W} (x, y, W)

, precisely as stated in Theorem 5.

Convergence of Training a Small Network

While, for symmetric weights,

ν (W)

equals the negative of the gradient of the objective function, this is no longer true when we allow the network to have asymmetric weights.

In [15], the authors claim that experiments run on the MNIST dataset show that the objective function J consistently decreases, but give no conditions for the alignment between

ν (W)

and

\frac{\partial J}{\partial W} (x, y, W)

to happen. In an attempt to study how aligned these two vectors are in practice, we consider a network of 2 input neurons, 1 intermediate neuron, and 1 output neuron, as well as the bias neuron (see Figure 3). Then, using Theorem 5, we explicitly calculate

μ (W)

,

- \frac{\partial J}{\partial W} (s^{0})

and compare them.

Colored in red, we can see the connections added to the DEEP model in this work. The vector field

V = [V_{3}, V_{4}]

and the cost function C are given by

\begin{matrix} V_{3} (s) & = \sum_{i = 0}^{4} W_{i 3} s_{i} - s_{3} \sum_{i = 0}^{4} W_{3 i} \end{matrix}

(51)

\begin{matrix} V_{4} (s) & = \sum_{i = 0}^{4} W_{i 4} s_{i} - s_{4} \sum_{i = 0}^{4} W_{4 i} \end{matrix}

(52)

\begin{matrix} C (s, y) & = \frac{1}{2} {(s_{4} - y)}^{2} . \end{matrix}

(53)

To simplify, we assume there is only leakage in neuron 4 and

W_{34}, W_{43} \geq 0

. Then,

\frac{\partial C}{\partial s} (s^{0}) = [\begin{matrix} 0 & s_{4}^{0} - y \end{matrix}],

(54)

\frac{\partial V}{\partial s} (s^{0}) = [\begin{matrix} - \sum_{i = 0}^{4} W_{3 i} & W_{43} \\ W_{34} & - \sum_{i = 0}^{4} W_{4 i} \end{matrix}],

(55)

and

{(\frac{\partial V}{\partial s} (s^{0}))}^{- 1} = \frac{1}{(\sum_{i = 0}^{4} W_{3 i}) (\sum_{i = 0}^{4} W_{4 i}) - W_{34} W_{43}} [\begin{matrix} - \sum_{i = 0}^{4} W_{4 i} & - W_{43} \\ - W_{34} & - \sum_{i = 0}^{4} W_{3 i} \end{matrix}] .

(56)

Consider the network’s weights in the order

W_{03}

,

W_{04}

,

W_{13}

,

W_{14}

,

W_{23}

,

W_{24}

,

W_{34}

,

W_{41}

,

W_{42}

, and

W_{43}

. Then,

\frac{\partial V}{\partial W} (s^{0})

is the following

2 \times 10

matrix:

[\begin{matrix} s_{0} & 0 & s_{1} & 0 & s_{2} & 0 & - s_{3} & 0 & 0 & s_{4} \\ 0 & s_{0} & 0 & s_{1} & 0 & s_{2} & s_{3} & - s_{4} & - s_{4} & - s_{4} \end{matrix}] .

(57)

Using Theorem 5, we can explicitly compute

ν (W)

,

- \frac{\partial J}{\partial W} (W)

and their inner product:

\begin{matrix} \begin{matrix} ν (W) = & \frac{s_{4}^{0} - y}{det (\frac{\partial V}{\partial s})} (- W_{43} s_{0}, - \sum_{i = 1}^{4} W_{3 i} s_{0}, - W_{43} s_{1}, - \sum_{i = 1}^{4} W_{3 i} s_{1}, - W_{43} s_{2}, \\ - \sum_{i = 1}^{4} W_{3 i} s_{2}, (W_{43} - \sum_{i = 1}^{4} W_{3 i}) s_{3}, \sum_{i = 1}^{4} W_{3 i} s_{4}, \sum_{i = 1}^{4} W_{3 i} s_{4}, (- W_{43} + \sum_{i = 1}^{4} W_{3 i}) s_{4}) \end{matrix} \end{matrix}

(58)

\begin{matrix} \begin{matrix} - \frac{\partial J}{\partial W} (W) & = \frac{s_{4}^{0} - y}{det (\frac{\partial V}{\partial s})} (- W_{34} s_{0}, - \sum_{i = 1}^{4} W_{3 i} s_{0}, - W_{34} s_{1}, - \sum_{i = 1}^{4} W_{3 i} s_{1}, - W_{34} s_{2}, \\ - \sum_{i = 1}^{4} W_{3 i} s_{2}, (W_{34} - \sum_{i = 1}^{4} W_{3 i}) s_{3}, \sum_{i = 1}^{4} W_{3 i} s_{4}, \sum_{i = 1}^{4} W_{3 i} s_{4}, (- W_{34} + \sum_{i = 1}^{4} W_{3 i}) s_{4}) \end{matrix} \end{matrix}

(59)

\begin{matrix} \begin{matrix} 〈 ν (W), - \frac{\partial J}{\partial W} (W) 〉 & = \frac{{(s_{4}^{0} - y)}^{2}}{det {(\frac{\partial V}{\partial s})}^{2}} (W_{34} W_{43} s_{0}^{2} + (\sum_{i = 1}^{4} W_{3 i})^{2} s_{0}^{2} + W_{34} W_{43} s_{1}^{2} + (\sum_{i = 1}^{4} W_{3 i})^{2} s_{1}^{2} + \\ W_{34} W_{43} s_{2}^{2} + (\sum_{i = 1}^{4} W_{3 i})^{2} s_{2}^{2} + (W_{34} - \sum_{i = 1}^{4} W_{3 i}) (W_{43} - \sum_{i = 1}^{4} W_{3 i}) s_{3}^{2} + \\ (\sum_{i = 1}^{4} W_{3 i})^{2} s_{4}^{2} + (\sum_{i = 1}^{4} W_{3 i})^{2} s_{4}^{2} + (W_{34} - \sum_{i = 1}^{4} W_{3 i}) (W_{43} - \sum_{i = 1}^{4} W_{3 i}) s_{4}^{2}) \end{matrix} \end{matrix}

(60)

Neuron 3 has no leakage, so

W_{34} = \sum_{i = 1}^{4} W_{3 i}

and this reduces to

〈 ν (W), - \frac{\partial J}{\partial W} (W) 〉 = \frac{{(s_{4}^{0} - y)}^{2}}{det {(\frac{\partial V}{\partial s})}^{2}} (W_{34} W_{43} (s_{0}^{2} + s_{1}^{2} + s_{2}^{2}) + W_{34}^{2} (s_{0}^{2} + s_{1}^{2} + s_{2}^{2} + s_{4}^{2})) > 0 .

(61)

Since

0 \leq s_{i}^{2} \leq 1

and

s_{0} = 1

(the bias neuron has activity fixed to 1), we have

\begin{matrix} \begin{matrix} 〈 ν (W), - \frac{\partial J}{\partial W} (W) 〉 & \geq \frac{{(s_{4}^{0} - y)}^{2}}{det {(\frac{\partial V}{\partial s})}^{2}} (W_{34} W_{43} (1 + 0 + 0) + W_{34}^{2} (1 + 0 + 0 + 0)) \\ = \frac{{(s_{4}^{0} - y)}^{2}}{det {(\frac{\partial V}{\partial s})}^{2}} W_{34} (W_{34} + W_{43}) . \end{matrix} \end{matrix}

(62)

Analogously,

\begin{matrix} \begin{matrix} ∥ ν (W) ∥ = & |\frac{s_{4}^{0} - y}{det (\frac{\partial V}{\partial s})}| ((s_{0}^{2} + s_{1}^{2} + s_{2}^{2}) (W_{34}^{2} + W_{43}^{2}) + s_{3}^{2} (W_{34} - W_{43})^{2} + \\ {s_{4}^{2} (2 W_{34}^{2} + {(W_{34} - W_{43})}^{2}))}^{\frac{1}{2}} \\ \leq & |\frac{s_{4}^{0} - y}{det (\frac{\partial V}{\partial s})}| ((1 + 1 + 1) (W_{34}^{2} + W_{43}^{2}) + 1 (W_{34} - W_{43})^{2} + \\ {1 (2 W_{34}^{2} + {(W_{34} - W_{43})}^{2}))}^{\frac{1}{2}} \\ = & |\frac{s_{4}^{0} - y}{det (\frac{\partial V}{\partial s})}| {(5 W_{34}^{2} + 3 W_{43}^{2} + 2 {(W_{34} - W_{43})}^{2})}^{\frac{1}{2}} \end{matrix} \end{matrix}

(63)

and

\begin{matrix} \begin{matrix} ∥ \frac{\partial J}{\partial W} (W) ∥ & = |\frac{s_{4}^{0} - y}{det (\frac{\partial V}{\partial s})}| {((s_{0}^{2} + s_{1}^{2} + s_{2}^{2} + s_{4}^{2}) (2 W_{34}^{2}))}^{\frac{1}{2}} \\ \leq |\frac{s_{4}^{0} - y}{det (\frac{\partial V}{\partial s})}| {((1 + 1 + 1 + 1) (2 W_{34}^{2}))}^{\frac{1}{2}} \\ = |\frac{s_{4}^{0} - y}{det (\frac{\partial V}{\partial s})}| \sqrt{8} W_{34} . \end{matrix} \end{matrix}

(64)

These computations allow us to take a lower bound on the cosine of the angle

θ

between

ν (W)

and

- \frac{\partial J}{\partial W} (W)

, which ideally would always be close to 1.

\begin{matrix} \begin{matrix} cos θ & = \frac{〈 ν (W), - \frac{\partial J}{\partial W} (W) 〉}{∥ ν (W) ∥ \cdot ∥ \frac{\partial J}{\partial W} (W) ∥} \\ > \frac{\frac{{(s_{4}^{0} - y)}^{2}}{det {(\frac{\partial V}{\partial s})}^{2}}}{\frac{{(s_{4}^{0} - y)}^{2}}{det {(\frac{\partial V}{\partial s})}^{2}}} \frac{W_{34} (W_{34} + W_{43})}{\sqrt{8} W_{34} \sqrt{5 W_{34}^{2} + 3 W_{43}^{2} + 2 (W_{34} - W_{43})^{2}}} \\ = \frac{W_{34} + W_{43}}{\sqrt{8} \sqrt{5 W_{34}^{2} + 3 W_{43}^{2} + 2 (W_{34} - W_{43})^{2}}} \end{matrix} \end{matrix}

(65)

Consider the function f defined as

\begin{matrix} f : R_{\geq 0}^{2} & \to R \\ (a, b) & \mapsto \frac{a + b}{\sqrt{5 a^{2} + 3 b^{2} + 2 {(a - b)}^{2}}} . \end{matrix}

Writing

b = a \cdot k

,

k \in R_{\geq 0}

, yields

f (a, b) = \frac{a (1 + k)}{\sqrt{a^{2} (5 + 3 k^{2} + 2 {(1 - k)}^{2})}} \Leftrightarrow f (a, b) = \frac{1 + k}{\sqrt{5 + 3 k^{2} + 2 {(1 - k)}^{2}}} .

(66)

It is easy to see that this function has a minimum of

\frac{1}{\sqrt{7}}

when

k = 0 \Leftrightarrow b = 0

and a maximum of around

0.718

when

k = 1.286

. Furthermore, when

a = 0

,

f (0, b) = \frac{1}{\sqrt{5}}

and

0.718 > \frac{1}{\sqrt{5}} > \frac{1}{\sqrt{7}}

.

We conclude that

cos θ > \frac{1}{\sqrt{8} \sqrt{7}} \Leftrightarrow θ < 82 . 33^{\circ} .

(67)

In particular, if

W_{43} = 1.286 W_{34}

, it holds that

θ < 75 . 29^{\circ} .

(68)

This guarantees some alignment between the new learning rule

ν (W)

, presented for the DEEP model in this example network, and the gradient of the objective function. Further research is necessary to determine more general conditions for the alignment of

ν (W)

and the gradient of the objective function in more general networks.

6. Conclusions

In this paper, we employed dynamical systems theory to rigorously analyze learning dynamics in models of biologically plausible neural networks, encompassing both feed-forward and recurrent architectures. By critically examining the DirEcted Equilibrium Propagation (DEEP) model introduced in [11], we identified inherent stability limitations arising from the absence of an energy function. To address these, we proposed an extension of the DEEP model with a neuronal leakage term and established, via theoretical analysis, that this modification ensures convergence in both the inference and learning phases.

Additionally, we clarified the relationship among various local learning rules, demonstrating that the update mechanisms in equilibrium propagation [5] align with those in [15], reinforcing the broader applicability of these learning principles.

Our investigation into learning with asymmetric (non-symmetric) feedback weights revealed that, although such systems depart from traditional energy-based frameworks, learning can still proceed under certain conditions. Specifically, for a small four-neuron network, we derived an explicit bound on the angle between the update vector and the true gradient, showing that effective learning is possible in this restricted case. However, generalizing these convergence guarantees to larger or more complex architectures remains an open challenge.

The main limitation of our study lies in its focus on small network sizes and specific architectural assumptions. While our theoretical results offer valuable insight, practical implementation and scalability in larger, real-world networks have yet to be established. Further work is needed to derive generalized angle bounds and convergence criteria for networks of arbitrary size and complexity, as well as to explore the empirical performance of these models on benchmark tasks.

We hope that our findings shed new light on the possibilities of biologically plausible learning in neural networks and encourage further research—both theoretical and experimental—into designing robust, scalable, and biologically inspired learning algorithms.

Author Contributions

Conceptualization, P.A.S.; Investigation, P.C.; Writing—original draft, P.C.; Writing—review & editing, P.A.S.; Supervision, P.A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UIDB/50021/2020, and by a Calouste Gulbenkian Foundation grant under the Novos Talentos em Matemática programme.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the analyses; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ANN	Artificial Neural Networks
DEEP	DirEcted Equilibrium Propagation
EP	Equilibrium Propagation

References

Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv 2017, arXiv:1712.01815. [Google Scholar] [CrossRef]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of Go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Scellier, B.; Bengio, Y. Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation. Front. Comput. Neurosci. 2017, 11, 24. [Google Scholar] [CrossRef] [PubMed]
Laborieux, A.; Ernoult, M.; Scellier, B.; Bengio, Y.; Grollier, J.; Querlioz, D. Scaling Equilibrium Propagation to Deep ConvNets by Drastically Reducing Its Gradient Estimator Bias. Front. Neurosci. 2021, 15, 633674. [Google Scholar] [CrossRef] [PubMed]
Martin, E.; Ernoult, M.; Laydevant, J.; Li, S.; Querlioz, D.; Petrisor, T.; Grollier, J. EqSpike: Spike-driven equilibrium propagation for neuromorphic implementations. iScience 2021, 24, 102222. [Google Scholar] [CrossRef] [PubMed]
Foroushani, A.N.; Assaf, H.; Noshahr, F.H.; Savaria, Y.; Sawan, M. Analog Circuits to Accelerate the Relaxation Process in the Equilibrium Propagation Algorithm. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar] [CrossRef]
Kiraz, F.Z.; Pham, D.K.G.; Desgreys, P. Impacts of Feedback Current Value and Learning Rate on Equilibrium Propagation Performance. In Proceedings of the 2022 20th IEEE Interregional NEWCAS Conference (NEWCAS), Quebec City, QC, Canada, 19–22 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 519–523. [Google Scholar] [CrossRef]
Laborieux, A.; Zenke, F. Holomorphic Equilibrium Propagation Computes Exact Gradients Through Finite Size Oscillations. Neural Inf. Process. Syst. 2022. [Google Scholar] [CrossRef]
Farinha, M.T.; Pequito, S.; Santos, P.A.; Figueiredo, M.A.T. Equilibrium Propagation for Complete Directed Neural Networks. In Proceedings of the ESANN 2020: 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 2–4 October 2020. [Google Scholar]
Gerschgorin, S. Über die Abgrenzung der Eigenwerte einer Matrix. Izv. Akad. Nauk. USSR Otd. Fiz.-Mat. Nauk 1931, 6, 749–754. (In German) [Google Scholar]
Ortega, J.; Rheinboldt, W. Iterative Solution of Nonlinear Equations in Several Variables; Elsevier: Amsterdam, The Netherlands, 1970; p. 358. [Google Scholar]
Atkinson, K.E. An Introduction to Numerical Analysis, 2nd ed.; Wiley: Hoboken, NJ, USA, 1991; p. 720. [Google Scholar]
Scellier, B.; Goyal, A.; Binas, J.; Mesnard, T.; Bengio, Y. Generalization of Equilibrium Propagation to Vector Field Dynamics. arXiv 2018, arXiv:1808.04873. [Google Scholar]

Figure 1. Example network of the EP model.

Figure 2. Example network of the DEEP model.

Figure 3. Small network considered, where

s_{0} \equiv 1

denotes the bias neuron.

Figure 3. Small network considered, where

s_{0} \equiv 1

denotes the bias neuron.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Costa, P.; Santos, P.A. Directed Equilibrium Propagation Revisited. Mathematics 2025, 13, 1866. https://doi.org/10.3390/math13111866

AMA Style

Costa P, Santos PA. Directed Equilibrium Propagation Revisited. Mathematics. 2025; 13(11):1866. https://doi.org/10.3390/math13111866

Chicago/Turabian Style

Costa, Pedro, and Pedro A. Santos. 2025. "Directed Equilibrium Propagation Revisited" Mathematics 13, no. 11: 1866. https://doi.org/10.3390/math13111866

APA Style

Costa, P., & Santos, P. A. (2025). Directed Equilibrium Propagation Revisited. Mathematics, 13(11), 1866. https://doi.org/10.3390/math13111866

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Directed Equilibrium Propagation Revisited

Abstract

1. Introduction

2. Equilibrium Propagation and Directed Equilibrium Propagation

2.1. Notation

2.2. Equilibrium Propagation

2.2.1. The Network

2.2.2. Auxiliary Functions

2.2.3. The Dynamics and Training Algorithm

2.3. DirEcted Equilibrium Propagation (DEEP)

General Stability Test

3. Stability Issues of the DEEP Model

3.1. Inexistence of Stable States

3.2. On the Assumptions of Corollary 1

4. Generalizing the DEEP Model

4.1. New Stability Conditions

4.2. On the Stability During Learning Phase ( $β > 0$ )

4.2.1. DEEP Model

4.2.2. Theorem 4

4.3. Remark on the Link Between New and Previous Stability Conditions

5. Generalization of Equilibrium Propagation to Vector Field Dynamics

5.1. On the Learning Rule

Particular Case of the Original EP

5.2. $ν (W)$ in the DEEP Model

Convergence of Training a Small Network

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Directed Equilibrium Propagation Revisited

Abstract

1. Introduction

2. Equilibrium Propagation and Directed Equilibrium Propagation

2.1. Notation

2.2. Equilibrium Propagation

2.2.1. The Network

2.2.2. Auxiliary Functions

2.2.3. The Dynamics and Training Algorithm

2.3. DirEcted Equilibrium Propagation (DEEP)

General Stability Test

3. Stability Issues of the DEEP Model

3.1. Inexistence of Stable States

3.2. On the Assumptions of Corollary 1

4. Generalizing the DEEP Model

4.1. New Stability Conditions

4.2. On the Stability During Learning Phase ( β > 0 )

4.2.1. DEEP Model

4.2.2. Theorem 4

4.3. Remark on the Link Between New and Previous Stability Conditions

5. Generalization of Equilibrium Propagation to Vector Field Dynamics

5.1. On the Learning Rule

Particular Case of the Original EP

5.2. ν ( W ) in the DEEP Model

Convergence of Training a Small Network

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. On the Stability During Learning Phase ( $β > 0$ )

5.2. $ν (W)$ in the DEEP Model