Next Article in Journal
OMAL: A Multi-Label Active Learning Approach from Data Streams
Previous Article in Journal
Lower Limit of Percolation Threshold on Square Lattice with Complex Neighborhoods
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Vector Flows That Compute the Capacity of Discrete Memoryless Channels †

by
Guglielmo Beretta
1,2,* and
Marcello Pelillo
1,3,4,*
1
DAIS, Università Ca’ Foscari di Venezia, Via Torino 155, 30170 Venezia, Italy
2
DAUIN, Politecnico di Torino, Corso Castelfidardo 34/d, 10138 Torino, Italy
3
College of Mathematical Medicine, Zhejiang Normal University, Jinhua 321004, China
4
European Centre for Living Technology, Ca’ Bottacin, Dorsoduro 3911, Calle Crosera, 30123 Venezia, Italy
*
Authors to whom correspondence should be addressed.
This paper is an extended version of our paper pubished in 7th International Conference, DIS 2024, Kalamata, Greece, 2–7 June 2024.
Entropy 2025, 27(4), 362; https://doi.org/10.3390/e27040362
Submission received: 8 February 2025 / Revised: 25 March 2025 / Accepted: 25 March 2025 / Published: 29 March 2025
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
One of the fundamental problems of information theory, since its foundation by C. Shannon, has been the computation of the capacity of a discrete memoryless channel, a quantity expressing the maximum rate at which information can travel through the channel. In this paper, we investigate the properties of a novel approach to computing the capacity, based on a continuous-time dynamical system. Interestingly, the proposed dynamical system can be regarded as a continuous-time version of the classical Blahut–Arimoto algorithm, and we can prove that the former shares with the latter an exponential rate of convergence if certain conditions are met. Moreover, a circuit design is presented to implement the dynamics, hence enabling analog computation to estimate the capacity.

1. Introduction

Estimating the capacity of a discrete memoryless channel (DMC) is a well-known problem related to the reliability of point-to-point communication systems, as a consequence of C. Shannon’s noisy-channel coding theorem [1,2,3]. A fundamental algorithm to address this problem is the classical Blahut–Arimoto algorithm (BAA), an iterative algorithm based on an alternating maximization procedure [4]. Named after S. Arimoto [5] and R. Blahut [6], who discovered it independently, the BAA requires only some mild conditions on the zero elements of the transition matrix and, unlike its antecedents, it is also applicable when the input alphabet and the output alphabet of the channel have different cardinalities. Notably, the BAA is still a subject of active research (see, e.g., Refs. [7,8,9,10]).
In contrast to the BAA, this paper describes a continuous-time dynamical system used to compute the capacity of a DMC. This dynamical system is obtained via the (forward) flow of a suitable ODE, and it can evolve a given distribution towards an optimal input distribution of the channel, hence enabling capacity computation. In studying this unconventional way to address capacity computation, we were inspired by the work by R. W. Brockett, who, as reported, e.g., in [11], provides a novel way, grounded in calculus, to approach problems traditionally addressed via algorithms [12]. Interestingly, the proposed capacity-computing ODE (denoted by CC-ODE in the sequel) has a connection with the BAA. Indeed, it can be regarded, in a sense, as a continuous-time version of the BAA. To support this claim, we leverage the notion of the multiplicative-weight-update (MWU) rule [13] and its connection to both the BAA (see Refs. [8,14]) and to some discretization techniques used, e.g., in evolutionary game theory [15,16].
The link between the BAA and the CC-ODE flow can be further extended when studying the convergence rates. By construction, the BAA generates a sequence of input distributions, and understanding the convergence rate of this sequence to an optimal input distribution has been central in the study of the BAA ever since its origin [5]. Remarkably, estimating an optimal input distribution for a generic DMC involves some issues affecting not only the BAA, but also any iterative algorithm running on a Turing machine, as recently shown in [10] via computability theory arguments. Despite this, some technical conditions on DMCs ensure exponential convergence of this sequence to an optimal input distribution [5,9]. We prove that under these conditions, in the formulation given in [9], the CC-ODE flow converges exponentially to an optimal input distribution, which can be shown thanks to some tools of Lyapunov’s stability theory [17]. The convergence rate can be further refined for a trivial family of DMCs, namely, the noiseless symmetric channels—see, e.g., Ref. [18] (p. 77). Even though a known formula exists for their capacity, we found it interesting that these channels are associated with a CC-ODE for which an explicit analytic solution is available, and we leverage this to produce a more precise asymptotic estimate of the flow.
Lastly, we propose a circuit design to implement the CC-ODE flow, thereby enabling analog computation of the capacity. Analog computation is an important alternative to digital computation [19], and is still a topic of active research—see, e.g., Refs. [20,21,22,23,24,25,26,27]. We speculate about how the proposed circuit could preserve its effectiveness even in the presence of noise, and we comment on its usage in association with the unavailability of some channel input symbols. See also Ref. [28] for a similar circuit design dealing with a labeling task.
This paper is a follow-up of [29], where the empirical usage of numerical methods applied to the CC-ODE was studied to compute the capacity. However, no mathematical proofs were given in [29], and except for a simplified version of Theorem 2, the results presented in this paper do not appear in [29]. To the best of our knowledge, no previous work has studied continuous-time dynamical systems to compute the capacity of DMCs.
Despite this, the CC-ODE can be regarded, under some technical conditions, as an instance of a class of ODEs discussed in [30], where smooth functions are optimized over polyhedra using some notions of Riemannian geometry. In addition to that, similar ODEs appear in the literature on convex optimization (see, e.g., Ref. [31]), as well as in some models pertaining to evolutionary game theory [16,32]. However, in contrast to [16,30,31], the objective function associated with the CC-ODE may be not differentiable on part of the boundary of the feasible set, and we provide the technical arguments that are required to adapt the existing results to the problem under discussion.
The subsequent sections of this paper are organized as follows. Notation conventions are established in Section 2, and Section 3 reviews the aforementioned class of ODEs that have been used in the literature to tackle optimization programs on a standard simplex. We also mention how relaxed hypotheses on the objective function may negatively affect trajectory convergence to stationary points, and in Lemma 1, we discuss some alternative conditions to overcome this issue. In Section 4, the reader is introduced to the problem of computing the channel capacity for DMCs and its formulation as a concave optimization program on a standard simplex. The main contributions of this paper are in Section 5, which deals with the CC-ODE, its properties, and its link with the BAA, and Section 6, where the convergence rates are discussed. The circuit designed to implement the CC-ODE flow is examined in Section 7. Some simulations are described in Section 8 so as to give more insight into how the CC-ODE flow behaves in comparison with the BAA. Section 9 contains some remarks on our results, and final considerations and future research directions are reported in Section 10.

2. Notations

In this paper, the information content is measured in nats, since this choice simplifies the theoretical computations involving mutual information, as mentioned, e.g., in [6]. For a conversion to bits, we recall that 1 nat equals ln 2 bits [2]. In the sequel, we consider the expression α ln α as well defined and equal to 0 in case α = 0 . For every integer n > 0 , the standard simplex in R n is the set
Δ n = z R n | z 0 and i = 1 n z i = 1 ,
its (relative) interior is the set
int ( Δ n ) = z R n | z > 0 and i = 1 n z i = 1 ,
and its (relative) boundary is the set Δ n = Δ n \ int ( Δ n ) . For z = ( z 1 , , z n ) R n , the support of x is the set supp ( z ) = { i [ n ] z i 0 } . The canonical basis of R n is denoted by e 1 , … e n , and 1 n = i = 1 n e i . Given x , y R n , we write x , y = i = 1 n x i y i for the usual dot product between x and y , whereas we write x = x , x = i = 1 n x i 2 for the Euclidean norm of x , so that x y denotes the Euclidean distance between x and y . In this paper, gradients are column vectors. Given x 1 , x 2 , …, and x n R , we define
MSQ ( x 1 , x n ) = 1 n i = 1 n x i n 1 k = 1 n x k 2 .

3. Optimizing Continuous-Time Dynamics

3.1. Preliminaries

We recall that a function v : Ω R n R m is (globally) Lipschitz continuous on Ω if there exists a constant L > 0 such that v ( x 1 ) v ( x 2 ) L x 1 x 2 for every x 1 , x 2 Ω , whereas v is locally Lipschitz continuous on Ω if for every x Ω , there exists an open neighborhood O x R n of x such that the restriction of v to Ω O x is Lipschitz continuous on Ω O x . Global and local Lipschitz continuity play a fundamental role in the existence and uniqueness results concerning solutions of an ODE, as in Picard–Lindelöf theorem—see, e.g., Refs. [33,34]. Given an open Ω R n , a function v : Ω R n that is locally Lipschitz continuous on Ω , consider the ODE
z ˙ = v ( z ) .
We will say that a non-empty set S Ω is invariant under (1) if for every y S , there exists z : , + S that solves (1) and satisfies z ( 0 ) = y . Similarly, we will say that S is forward invariant under (1) if for every y S , there exists z : 0 , + S that solves (1) and satisfies z ( 0 ) = y . By the assumptions made on v , note that for every y S there exists at most one solution z : 0 , + Ω of (1) such that z ( 0 ) = y . This is a trivial consequence of Picard–Lindelöf theorem.However, note that these assumptions do not guarantee the existence of such a solution—which is related to the problem of extending local solutions (see also Ref. [35] (Ch. 17.4))—and we remark that stronger assumptions are generally required to prove the existence, such as those reported in Appendix A. We also recall that a point y Ω is stationary for (1) if v ( y ) = 0 , i.e., if the constant function z ( t ) y solves (1).

3.2. Optimizing Differential Equation on the Standard Simplex

A program of the form
max z Δ n    f ( z )
is related, under suitable conditions on its objective function f, to the ODE
z ˙ i = z i i f ( z ) k = 1 n z k k f ( z ) ,        i [ n ] ,
which can also be written in vector notation as
z ˙ = [ diag ( z ) z z ] f ( z ) .
In particular, for z int ( Δ n ) , the ODE (3) is known as the Shahshahani gradient system associated with f—see, e.g., Refs. [16,32]. Many properties of (3) are discussed in the literature under the assumption that f admits a globally Lipschitz continuous gradient on Δ n , which guarantees the invariance of Δ n under (3) and that the function f increases strictly along any non-constant solution of (3) evolving on Δ n ; see, e.g., Refs. [16,30,31,32,36], and also Appendix B. Indeed, strictly speaking, the condition that is often assumed is that the gradient of f is locally Lipschitz continuous on some open superset of Δ n (see, e.g., Ref. [15]), which is a stronger assumption, as shown, e.g., in [35] (p. 400).
However, as we shall see in the following sections, we are interested in objective functions that can exhibit singularities on Δ n and so we cannot rely on the usual assumptions made in the literature. Crucially, if the function f is differentiable in int ( Δ n ) but not on some points of Δ n , then forward trajectories of (3) may converge to one of these points, where (3) is undefined, and this may happen even in case additional hypotheses such as concavity are met, as in the example shown in Appendix C. By contrast, this cannot happen under the circumstances described in the next lemma, which we will apply in the sequel:
Lemma 1.
Let f : R n R admit a locally Lipschitz continuous gradient on some open Ω R n . Suppose f is concave on some non-empty compact convex K Ω Δ n . Let z : 0 , + K be a solution of (3).
(i)
There exists z * = lim t + z ( t ) K and z * is a stationary point for (3);
(ii)
If z ( 0 ) int ( Δ n ) , then z * is a KKT point for (2);
(iii)
If z ( 0 ) int ( Δ n ) and f is concave and continuous on Δ n , then z * is a global solution of (2).
Proof. 
(i): We now make the following claim, whose technical proof is deferred to Appendix D:
Claim 1
(Convergence). There exists z * = lim t + z ( t ) K .
Note that z * K entails z * Ω ; hence, f is well defined and continuous in a neighborhood of z * . A standard argument can now be applied to show that z * is stationary for (3)—namely, since the ω -limit here does not escape Ω , then the singleton { z * } is forward invariant under (3) as a consequence of Gronwall’s lemma—see, e.g., Ref. [16].
(ii): We will use the following claim, which provides a known alternative description, for maximization programs over the standard simplex, of the KKT conditions:
Claim 2
(KKT points). A point x * Δ n in which f is differentiable is a KKT point for (2) if and only if there exists some λ R such that
i f ( x * ) λ for every i [ n ] , with equality for every i supp ( x * ) .
The interested reader may find a proof for Claim 2 in Appendix E. Assume z ( 0 ) int ( Δ n ) , which entails z ( t ) > 0 for every t > 0 by uniqueness of local solutions—see, e.g., Proposition A1.(iii). By (i), the point z * = lim t + z ( t ) exists in Ω . Define μ ( z ) = k = 1 n z k k f ( z ) and set λ = μ ( z * ) . Since z * is a stationary point for (3), it follows that i f ( z * ) = λ for every i supp ( z * ) . The proof is completed by reductio ad absurdum as in the proof of [15] (Prop. 3.5). In fact, let i supp ( z * ) —i.e., let z i * = 0 —and suppose also that i f ( z * ) > λ = μ ( z * ) . Then, by continuity, also i f ( y ) > μ ( y ) for every y Ω sufficiently close to z * ; hence, z ˙ i ( t ) > 0 for every t > 0 sufficiently large, which contradicts z i ( t ) 0 + . It follows that z * is a KKT point for (2) by Claim 2.
(iii): If f is concave on Δ n , then every KKT point for (2) is a global solution for (2)—see, e.g., Ref. [37]; hence, (iii) follows by (ii). □

4. Problem Formulation

4.1. Discrete Memoryless Channel and Capacity

A discrete memoryless channel (DMC) is a communication system that can be described by a triplet C = ( X , Y , P ) , in which X = { x 1 , x 2 , , x n } and Y = { y 1 , y 2 , , y m } are finite alphabets called the input alphabet and output alphabet, respectively, whereas P = [ p ( j | i ) ] i [ n ] , j [ m ] R n × m is a stochastic matrix, called the transition matrix, where p ( j | i ) expresses the probability that the symbol y j Y is observed as output of the system whenever the symbol x i X is sent to the system as input [3]. Without loss of generality, we will work under the following assumption:
For every j [ m ] , there exists at least one i [ n ] such that p ( j | i ) > 0 .
In other words, we are assuming that Y represents the minimal output alphabet required for a description of the DMC. In fact, this assumption ensures that for any selected symbol y Y there exists a corresponding input distribution for which y occurs as output with positive probability.
Let X be the input variable of C , i.e., the random variable with range in X modeling the input of the channel, and let Y be the output variable of C , i.e., the random variable modeling the output of the channel. Following [2], the mutual information between X and Y, denoted by I [ X ; Y ] , is a non-negative quantity measuring the reduction in uncertainty about X that results from learning the value of Y, and its formulation involves the notion of entropy, which for a generic discrete random variable D with range in a set D is given by H [ D ] = d D p D ( d ) ln [ p D ( d ) ] . The random variable Y and all the random variables Y | X = x obtained as x X appear, together with the input distribution p X , in the following expression for I [ X ; Y ] (see, e.g., Ref. [2]):
I [ X ; Y ] = H [ Y ] x X p X ( x ) H [ Y | X = x ] .
The capacity of C is the maximum value C of I [ X ; Y ] over all possible choices for p X :
C = max p X I [ X ; Y ] .
The capacity C provides a theoretical bound for the information content that can be transmitted through the channel [1,2,3].

4.2. Optimization Program for the Capacity

Consider a channel C = ( X , Y , P ) and set n = | X | and m = | Y | . The maximization problem associated with the capacity admits a well-known formulation as a constrained optimization program over a standard simplex—see, e.g., Ref. [10]. We now show how to derive this program and, in so doing, we take the chance to define some auxiliary functions and parameters that we will extensively use in this paper.
Define the function q = ( q 1 , , q m ) : R n R m , where q j is the linear function
z = ( z 1 , , z n ) q j ( z ) = i = 1 n p ( j | i ) z i .
If z i = p X ( x i ) for every i [ n ] , then q j ( z ) = p Y ( y j ) for every j [ m ] by the theorem of total probability; thus, z and q = q ( z ) can be identified with the distributions of X and Y respectively, and it is possible to write I [ X ; Y ] as a function of z . Indeed, it is sufficient to set
c i = j = 1 m p ( j | i ) ln [ p ( j | i ) ] ,        i [ n ] ,
define the function
I ( z ) = i = 1 n c i z i j = 1 m q j ln q j ,
and observe that, for z Δ n corresponding to p X , the equality I [ X ; Y ] = I ( z ) holds. As a result, the capacity C of the channel C satisfies
C = max z Δ n    I ( z )
Following [10], we call optimal input distribution every global solution to (6), i.e., every z * Δ n such that I ( z * ) = C .

5. Vector Flow for Capacity Computation

5.1. Flow Definition and Its Properties

The function I , which is the objective function of (6), is continuous and concave on Δ n —see, e.g., Ref. [3]. Define the open set
Ω = { z R n q j ( z ) > 0 for all j [ m ] } ,
and note that int ( Δ n ) Ω .
Proposition 1.
Let z Ω . Then,
(a)
k I ( z ) = c k 1 j = 1 m p ( j | k ) ln q j ( z ) ;
(b)
k = 1 n z k k I ( z ) = I ( z ) k = 1 n z k ;
(c)
i , k 2 I ( z ) = j = 1 m p ( j | i ) p ( j | k ) [ q j ( z ) ] 1 .
Proof. 
See Appendix F. □
In particular, I is well defined and continuous on Ω , and so I is locally Lipschitz continuous on Ω . However, note that I is singular on Ω .
Now, consider the ODE
z ˙ i = z i i I ( z ) k = 1 n z k k I ( z ) ,        i [ n ] .
We will also refer to Equation (7) as the capacity computing ODE (CC-ODE for short).
Theorem 1
(Forward invariance). The set Ω Δ n is forward invariant under (7).
Proof. 
Let
Y 0 = { j [ m ] p ( j | i ) = 0 for some i [ n ] } .
It is easy to see that for every j [ m ] and every z Δ n :
  • If j Y 0 , then q j ( z ) min i [ n ] p ( j | i ) > 0 ;
  • If j Y 0 , then q j ( z ) = 0 if and only if supp ( z ) S j = , where the set S j [ n ] is defined by S j = { i [ n ] p ( j | i ) 0 } .
For every j Y 0 , we define some the following objects:
  • The vector b ( j ) = i = 1 n p ( j | i ) e i ;
  • The function β j : Ω R given by
    β j ( z ) = b ( j ) , [ diag ( z ) z z ] I ( z ) ;
  • The real number ε j as a positive solution to the system of inequalities
    c i p ( j | i ) ln ε j C > 0 for all i S j ,
    where c i and C are defined as in (4), (6) (note that such an ε j exists, since ln u + as u 0 + and p ( j | i ) 0 for i S j ).
We make the following claim:
Claim 3.
Let z Ω Δ n and let j Y 0 . Then,
0 < q j ( z ) ε j β j ( z ) 0 .
Proof of Claim 3.
By Proposition 1, the definition of S j , and using that i = 1 n z i = 1 ,
β j ( z ) = i = 1 n p ( j | i ) z i i I ( z ) k = 1 n z k k I ( z ) = i = 1 n p ( j | i ) z i c i J = 1 m p ( J | i ) ln [ q J ( z ) ] I ( z ) = i S j p ( j | i ) z i c i J = 1 m p ( J | i ) ln [ q J ( z ) ] I ( z ) ,
and so
β j ( z ) i S j p ( j | i ) z i c i p ( j | i ) ln [ q j ( z ) ] C ,
since z Δ n entails both I ( z ) C and ln [ q J ( z ) ] 0 for every J. If q j ( z ) ε j , then ln q j ( z ) ln ε j , and so, by (9) and (11), it follows that β j ( z ) 0 . □
Now, let y Ω Δ n . We have to prove that a solution z : 0 , + Ω Δ n of (7) exists satisfying z ( 0 ) = y . We first make the following claim:
Claim 4. 
There exists a convex compact K Ω Δ n such that y K , and K is forward invariant under (7).
Proof of Claim 4.
For every j Y 0 , set α j = min { ε j , q j ( y ) } and define
K = { z Δ n q j ( z ) α j for every j Y 0 } .
By construction, y K Ω Δ n . Note that K is a compact and convex polytope containing the elements z R n that satisfy the following constraints:
  • e i , z 0 for every i [ n ] ;
  • 1 n , z = 1 ;
  • b ( j ) , z α j for every j Y 0 .
Therefore, for every z K , the tangent cone T K ( z ) —see Appendix A or, e.g., Ref. [16]—is the set of u R n such that
(a)
u i 0 for every i supp ( z ) ;
(b)
i = 1 n u i = 0 ;
(c)
For every j Y 0 , if b ( j ) , z = α j then b ( j ) , u 0 .
Note now that [ diag ( z ) z z ] I ( z ) T K ( z ) for every z K . In fact, (a) and (b) are immediate to check, whereas (c) holds by Claim 3. The thesis follows by Theorem A1 in Appendix A. □
By Claim 4, there exists a compact K Ω Δ n and a solution z : 0 , + K Ω Δ n of (7) satisfying z ( 0 ) = y . □
By Theorem 1, we can define a continuous-time dynamical system on Ω Δ n via the (forward) flow generated on Ω Δ n by the CC-ODE, i.e., via the function
φ I : Ω Δ n × 0 , + Ω Δ n
such that for every y Ω Δ n , the function z = φ I ( y , · ) is the unique solution of
z ˙ = [ diag ( z ) z z ] I ( z ) , t 0 z ( 0 ) = y .
Proposition 2
(CC-ODE flow properties). Let y Ω Δ n and set z = φ I ( y , · ) .
(a)
The equality supp ( z ( t ) ) = supp ( y ) holds for every t 0 ;
(b)
Either y is a stationary point for (7) or I z is strictly increasing;
(c)
There exists z * = lim t + z ( t ) Ω Δ n , and z * is a stationary point for (7);
(d)
The restriction of I to the set Γ = { x Δ n supp ( x ) supp ( y ) } attains its maximum in z * :
z * arg max z Γ I ( z ) ;
Proof. 
(a), (b): These properties hold for Shahshahani gradient systems trajectories, and similar proofs are valid in our setting by the uniqueness of local solutions of (7). For more details, see Appendix B.
(c): This follows from Claim 4 and Lemma 1.(i).
(d): Assume without loss of generality that only the first n ˜ coordinates of y are positive, where 0 < n ˜ n . Consider the injection ι : R n ˜ R n given by ι ( w ) = ( w , 0 ) , and define the function I = I ι . Observe that z = ι x by uniqueness of solutions, where x : 0 , + Δ n ˜ solves
x ˙ = [ diag ( x ) x x ] I ( x ) , t 0 x ( 0 ) = ι 1 ( y ) .
Since Γ = ι ( Δ n ˜ ) and ι 1 ( y ) int ( Δ n ˜ ) by construction, the result follows by Lemma 1 applied to f = I . □
In particular, Proposition 2 gives the following fundamental theorem:
Theorem 2
(CC-ODE flow-attaining capacity). Let y int ( Δ n ) . Then lim t + φ I ( y , t ) exists and is an optimal input distribution. Either φ I ( y , · ) y is a constant function, or I φ I ( y , t ) is strictly increasing in t and converges to C as t + .
Proof. 
The proof follows directly from Proposition 2. □

5.2. Connection with Blahut–Arimoto Algorithm

The BAA is an iterative algorithm that can be described via a map Φ BA : Δ n Δ n , here called the Blahut–Arimoto map. Given an initial input distribution z ( 0 ) int ( Δ n ) , Blahut [6] and Arimoto [5] proved that the sequence { z ( k ) } k defined by z ( k + 1 ) = Φ BA ( z ( k ) ) satisfies I ( z ( k ) ) C as k . It has been observed [8] that Φ BA acts according to the MWU rule  [13]:
z i ( k + 1 ) = z i ( k ) exp [ w i ( z ( k ) ) ] = 1 n z ( k ) exp [ w ( z ( k ) ) ] ,
where the weights w 1 ( z ) , …, w n ( z ) satisfy
w i ( z ) = j [ m ] : p ( j | i ) 0 p ( j | i ) ln p ( j | i ) q j ( z ) .
Interestingly, there exists a numerical scheme used for approximating Shahshahani gradient systems that is also based on MWU, which is the following:
z i ( k + 1 ) = z i ( k ) exp [ τ i f ( z ( k ) ) ] = 1 n z ( k ) exp [ τ f ( z ( k ) ) ] ,
where τ > 0 is the stepsize and f must be sufficiently smooth. The recurrence described in (17) is well known in evolutionary game theory—see, e.g., the deduction of the discrete replicator-dynamics model [15]. One fundamental property of (17) is that if z ( k ) is an element of Δ n and the gradient of f is defined in z ( k ) , then (17) defines z ( k + 1 ) as an element of Δ n having the same support of z ( k ) , in agreement with the support invariance of the continuous-time dynamics—see also Appendix B. In particular, if z ( 0 ) int ( Δ n ) and f is differentiable in int ( Δ n ) , then the recurrence given by (17) can be used to produce a sequence { z ( k ) } k with arbitrary length (even though, in numerical implementations, the recurrence initialized in int ( Δ n ) could generate points on Δ n due to floating-point arithmetic, see also Ref. [29]).
We can now explain in the following theorem how the CC-ODE flow can be regarded as a continuous-time version of the BAA:
Theorem 3.
The Blahut-Arimoto map coincides with the MWU rule (17) applied to f = I defined as in (5) with stepsize τ = 1 .
Proof. 
Observe that
i I ( z ) = 1 + w i ( z ) ,
with w i ( z ) defined as in (16); hence, z ( k + 1 ) = Φ BA ( z ( k ) ) satisfies the following:
z i ( k + 1 ) = z i ( k ) exp [ i I ( z ( k ) ) + 1 ] = 1 n z ( k ) exp [ I ( z ( k ) ) + 1 ] = z i ( k ) exp [ i I ( z ( k ) ) ] = 1 n z ( k ) exp [ I ( z ( k ) ) ] .

6. Convergence Rate

6.1. Conditions for Exponential Convergence

Consider an optimal input distribution z * . We now present some conditions that ensure that the flow converges exponentially to z * . To this end, we will study a first-order approximation of
v ( z ) = [ diag ( z ) z z ] I ( z )
in the vicinity of z * . In our continuous-time scenario, this corresponds to what is examined in [9], where a truncated Taylor expansion of Φ BA ( z ) in z * is studied to investigate the rate of convergence of the sequence given by z ( k + 1 ) = Φ BA ( z ( k ) ) to z * , where z ( 0 ) int ( Δ n ) . To this end, we consider a classification over [ n ] introduced in [9] that involves the coordinates of z * and I ( z * ) . Specifically, an index i [ n ] is classified as follows (note that the case i I ( z * ) > C 1 is excluded by the KKT conditions):
  • type  I if i supp ( z * ) and i I ( z * ) = C 1 ;
  • type  II if i supp ( z * ) and i I ( z * ) = C 1 ;
  • type  III if i supp ( z * ) and i I ( z * ) < C 1 .
In particular, we remark that the vector μ R n given by
μ = ( μ 1 , , μ n ) T = i = 1 n [ C 1 i I ( z * ) ] e i
satisfies μ i > 0 if i is of type III , and μ i = 0 otherwise. As shown in the following proposition, this classification helps in the study of the first-order approximation of (19) near z * :
Proposition 3.
Let i [ n ] . Then,
v i ( z * ) = z i * [ i I ( z * ) I ( z * ) + 1 n ] if i is of type I , 0 if i is of type II , i I ( z ) I ( z ) + 1 ] e i if i is of type III .
Proof. 
By Proposition 1,
v i ( z ) = z i i I ( z ) I ( z ) + k = 1 n z k = i I ( z ) I ( z ) + k = 1 n z k e i + z i [ i I ( z ) I ( z ) + 1 n ] ,
and note that I ( z * ) = C . □
By Proposition 3, if i is of type II , then v i has a null gradient, and higher-order terms would be required to study v i in the vicinity of z * . Indeed, in order to obtain a non-singular linearization of (7) in a neighborhood of z * , we will make the following assumption:
(N1)
For some positive n n , the indices i = 1 , …, n are of type I and the remaining n = n n indices are of type III .
Similarly to what is done in [9], it is useful to introduce some additional notation to distinguish between indices of different types. To this end, for every x R n , define P I x = ( x 1 , , x n ) R n and P III x = ( x n + 1 , , x n ) R n so that x = ( P I x , P III x ) . Note that ( z * ) = ( P I z * ) , 0 and μ = ( 0 , ( P III μ ) ) . Moreover, let H be the Hessian of I in z * , and write the submatrix of H containing its first n rows as [ H I , I | H I , III ] , where H I , I R n × n and H I , III R n × n .
Theorem 4.
Assume (N1). Define Z = diag ( P I z * ) R n × n , let D = diag ( P III μ ) R n × n , and let J R n × n be the Jacobian matrix of v in z * . Then J ( z z * ) = M ( z z * ) for every z Δ n , where M R n × n is the upper-triangular block matrix
M = Z H I , I Z H I , III + P I z * P III μ i n e D .
Proof. 
For every i [ n ] , the i-th row of H is precisely v i ( z ) . Apply Proposition 3, using the fact that i = 1 n z i z i * = 1 1 = 0 . □
Before proving our main result on the convergence rate of the CC-ODE flow, we need the following additional assumption:
(N2)
The first n rows of the transition matrix P are independent.
Theorem 5
(Exponential convergence rate). Suppose (N1) and (N2) hold. Then the maximum eigenvalue λ max of M is negative. Moreover, z * is the only optimal input distribution for the channel C , and for every 0 < α <   | λ max | , there exists δ, K > 0 such that for every y Ω Δ n , if y z * < δ , then
φ I ( y , t ) z * K y z * exp ( α t )
for every t 0 .
Proof. 
Consider first the matrix M defined in (21), whose eigenvalues are the union, counting multiplicity, of the eigenvalues of Z H I , I and those of D . The eigenvalues of D are μ n + 1 , …, μ n , whereas those of Z H I , I are described in the following lemma, proved in [9] (Section III.C):
Lemma 2
(Nakagawa et al. [9]). The matrix Z H I , I is diagonalizable, has eigenvalues 1 = λ 1 λ n 0 , and λ n < 0 if and only if the first n rows of the transition matrix P are independent.
By Lemma 2 and Assumption (N2), it follows that λ max < 0 .
The remaining part of the proof relies on a standard application of Lyapunov’s stability theory (see, e.g., Refs. [17,38]), combined with the application of Theorem 4. Given 0 < α <   | λ max | , the matrix M ˜ = M + α I has only real negative eigenvalues by construction. Consequently, the Lyapunov’s equation [17]
M ˜ B + B M ˜ = I
is solved by a positive definite symmetric matrix B R n × n . Consider the quadratic form V ( x ) = x , B x . Calling the minimum and the maximum eigenvalue of B respectively a and A, it follows that 0 < a < A and
a x 2 V ( x ) A x 2 .
Recall that
v ( z ) = J ( z z * ) + r ( z z * )
where J is the Jacobian matrix of v in z * and
r ( z z * ) z z * 0        as        z z * .
Therefore, setting z ( t ) = φ I ( y , t ) and performing the substitution η ( t ) = z ( t ) z * yields
d d t V ( η ) = 2 v ( z ) , B η = 2 η , B J η + 2 r ( η ) , B η = 2 η , B M η + 2 r ( η ) , B η ,
where the last equality is a consequence of Theorem 4. By definition of M ˜ , by (23), and using that B = B ,
2 η , B M η = 2 η , B M ˜ η 2 α η , B η = η , ( M ˜ B + B M ˜ ) η 2 α V ( η ) = η 2 2 α V ( η ) .
Hence,
d d t V ( η ) = η 2 2 α V ( η ) + 2 r ( η ) , B η = 1 2 r ( η ) η , B η η η 2 2 α V ( η ) .
Suppose now that for some T > 0 , the expression
1 2 r ( η ) η , B η η
is positive for every t 0 , T . Then, the inequality d / d t [ V ( η ) ] 2 α V ( η ) holds on 0 , T . By Gronwall’s lemma [17], this implies that V η ( t ) V η ( 0 ) exp ( 2 α t ) , and so by (24)
η ( t ) A a η ( 0 ) exp ( α t )
for every t 0 , T . What is left is proving that for δ > 0 sufficiently small, if y z * = η ( 0 ) < δ , then (26) is positive for every t 0 . This follows easily from (25) and (27).
Finally, to prove that z * is the unique optimal input distribution, suppose by absurd that there exists an optimal input distribution y * z * . Note that, by convexity of I , the set of optimal input distributions is a convex subset of Δ n . Then every convex combination of y * and z * is an optimal input distribution. In particular, infinitely many stationary points exist whose distance from z * does not exceed δ , and this contradicts the definition of δ . □
Theorem 5 is a continuous-time counterpart of the following result reported in [9]:
Theorem 6
(Nakagawa et al. [9]). Suppose (N1) and (N2) hold, and that there exists a unique optimal input distribution z * for the channel C . Define ϑ max as the maximum of the set { 1 + λ i λ i σ ( Z H ) } { e μ i i = n + 1 , n } . Then for every ϑ max < ϑ < 1 , there exists δ, K > 0 such that for every y int ( Δ n ) , if y z * < δ , then
Φ BA N ( y ) z * K ϑ N y z * .

6.2. Noiseless Symmetric Channels

In this section, we refine the result given in Theorem 5 under the additional hypothesis that C is a noiseless symmetric channel. Following [18] (pp. 77–78), we recall that the channel C = ( X , Y , P ) is called
  • Deterministic if the output Y is a deterministic function of the input X.
  • Lossless if the input X is completely determined by the output Y.
  • Noiseless if it is both deterministic and lossless.
  • Symmetric if in the transition matrix P every row is a permutation of every other row, and every column is a permutation of every other column.
We now assume that C is noiseless and symmetric. Then | Y | = | X | = n and, up to a suitable reordering of the output alphabet, we may assume that the transition matrix is the n × n identity matrix. Note that q = z in this case, and so Ω = 0 , + n , the objective function is
I ( z ) = i = 1 n z i ln z i ,
and the CC-ODE is
z i ˙ = z i ln z i I ( z ) + 1 k = 1 n z k ,        i [ n ] .
The channel admits a unique optimal input distribution, which is b = n 1 1 n , the barycenter of Δ n ; thus C = I ( b ) = ln n —see, e.g., Ref. [18] (pp. 84–85). In addition to that, it is easy to verify an interesting property of noiseless symmetric channels, namely that, for these channels, the BAA requires at most one iteration to find the optimal input distribution. By Theorem 2, we deduce that, for y int ( Δ n ) \ { b } , we have φ I ( y , t ) b as t + , and also φ I ( b , t ) b . We will see that a more precise asymptotic estimate on φ I is available by leveraging the explicit solution of (29) on int ( Δ n ) .
Theorem 7.
If C is the noiseless channel, then, for every y int ( Δ n ) \ { b } ,
φ I ( y , t ) b e t MSQ ( ln y 1 , ln y n ) n 1 / 2
as t + .
Proof. 
Given y int ( Δ n ) , the function z ( t ) = φ I ( y , t ) is given by the following explicit analytical expression:
z i ( t ) = exp e t ln y i k = 1 n exp e t ln y k .
Note that e t 0 as t + ; hence, using the McLaurin expansion of the exponential function,
z i n 1 = n exp ( e t ln y i ) k = 1 n exp ( e t ln y k ) n k = 1 n exp ( e t ln y k ) = n ( 1 + e t ln y i ) k = 1 n ( 1 + e t ln y k ) + o ( e t ) n k = 1 n 1 + o ( 1 ) = n ln y i k = 1 n ln y k + o ( 1 ) n 2 + o ( 1 ) e t .
Therefore,
n 2 e 2 t z ( t ) b 2 = i = 1 n [ n e t ( z i n 1 ) ] 2               i = 1 n ln y i n 1 k = 1 n ln y k 2 = n MSQ ( ln y 1 , ln y n ) ,
which gives (30). □

7. Analog Implementation

An attractive feature of continuous-time methods is their amenability to be mapped onto hardware circuits [39,40]. Figure 1 depicts a circuit implementing the CC-ODE flow to compute analogically the capacity of a DMC.
The circuit requires z 1 ( 0 ) , …, z n ( 0 ) as input, which represent the coordinates of a point z ( 0 ) int ( Δ n ) in which a trajectory of (7) is initialized. Moreover, the circuit requires, for every k [ n ] , a module to implement z z k k I ( z ) , which we here consider as a black-box. These modules encode the dependence of the circuit on the transition matrix P of the DMC. We remark on the following feature of these modules:
Proposition 4.
For every k [ n ] , the map Ω Δ n z z k k I ( z ) can be extended to a continuous function defined on Δ n .
Proof. 
Consider the function L : [ 0 , 1 ] R given by L ( u ) = u ln u for 0 < u 1 and L ( 0 ) = 0 , which is a continuous function on [ 0 , 1 ] . Then, by Proposition 1, for every z Ω Δ n ,
z k k I ( z ) = z k c k 1 j = 1 m p ( j | k ) ln [ q j ( z ) ] = z k ( c k 1 ) j = 1 m z k p ( j | k ) q j ( z ) L q j ( z ) ,
and since q j ( z ) = i = 1 n z i p ( j | i ) z k p ( j | k ) 0 , it follows that z z k k I ( z ) is bounded and admits a continuous extension to Δ n , which is given by
z z k c k 1 j [ m ] : q j ( z ) 0 p ( j | k ) ln [ q j ( z ) ] .
 □
The circuit outputs z = ( z 1 , , z n ) , which at time t = 0 coincides with z ( 0 ) , and thanks to the integrator elements appearing in the circuit [40], the recurrent design of the circuit evolves z according to Volterra’s equation [33]:
z i ( t ) = z i ( 0 ) + 0 t z i ( s ) i I z ( s ) z i ( s ) k = 1 n z k ( s ) k I z ( s ) d s .
Hence, z ( t ) = φ I z ( 0 ) , t for every t 0 . By Theorem 2, it follows that z ( t ) converges to an optimal input distribution as t + , which is also a stationary point for the system. Moreover, if the channel admits a unique optimal input distribution, then Theorem 2 ensures that this optimal input distribution is also an asymptotically stable stationary point of the system. Besides that, the circuit computes k = 1 n z k k I z ( t ) + 1 , which equals I ( z ) by Proposition 1 and converges to the capacity by Theorem 2.
We stress that the circuit drawn in Figure 1 is supposed to work in an ideal setting, where no noise affects z . Indeed, the circuit relies on the property that Ω Δ n is invariant under φ I , as shown in Theorem 1. By contrast, Figure 2 shows a variation in the circuit design that may mitigate the effect of small perturbations on z thanks to an additional normalization module. The normalization module sets every negative signal received as input to zero, and then normalizes the resulting vector of non-negative signals with respect to the 1 -norm.
So long as z stays in Ω Δ n , as should theoretically happen if z ( 0 ) also is in Ω Δ n , then the additional normalization module has no effect on the system, and both circuits in Figure 1 and Figure 2 behave in the same way. However, in the presence of a perturbation that affects the system, then the normalization module prevents z from exiting Δ n . In addition to this, if the modules z z k k I ( z ) are devised to implement the corresponding continuous extension described in Proposition 4, then by Peano’s existence theorem [33], the actual ODE solved by the circuit is well defined on Δ n , and not just on Ω Δ n , even though some trajectories may lead to a suboptimal input distribution, as explained in Section 9.

8. Some Illustrative Examples

To give a graphical demonstration showing the qualitative behavior of the CC-ODE flow, we simulated the CC-ODE for some specific channels. This was accomplished by applying the MWU described in (17) as an integration scheme to discretize the flow. To run the simulations, we considered two channels, and for each channel, we selected a different starting point to initialize the dynamics. For each of these configurations, we computed the (unique) optimal input distribution by running the BAA for 10,000 iterations. In particular, we considered the DMC with transition matrix
P ( 1 ) = 0.70 0.20 0.10 0.10 0.80 0.10 0.25 0.25 0.50 ,
which admits the unique optimal input distribution ( 0.3645 , 0.4169 , 0.2186 ) , and we considered the corresponding CC-ODE flow initialized in y ( 0.333 , 0.333 , 0.333 ) . Similarly, we considered the noiseless and symmetric DMC with transition matrix
P ( 2 ) = 1 0 0 0 1 0 0 0 1 ,
for which the optimal input distribution is approximately ( 0.333 , 0.333 , 0.333 ) , and we considered the corresponding flow initialized in ( 0.700 , 0.200 , 0.100 ) . To make a comparison between the CC-ODE vector flow and the BAA, we applied the MWU rule given by (17) with τ = 1 and with τ = 0.01 , since the choice τ = 1 yields exactly the Blahut–Arimoto map, whereas smaller values of τ lead to a better approximation of the CC-ODE flow. Figure 3 shows how the components of the input distribution vary as t runs. Note that, for both channels, the dynamics drive the system towards the optimal input distribution of the corresponding channel. Moreover, for the noiseless symmetric channel, note in Figure 3b that the BAA converges in one step. Figure 4 illustrates how the mutual information of input and output variable evolves. Also in this case, the figure displays that the CC-ODE can be used to attain the capacity, similarly to what happens for the BAA. Clearly, for the noiseless symmetric channel, the capacity is attained via the BAA with just one step—see Figure 4b. Finally, in Figure 5, we depict some graphs plotted in semilog scale to provide some estimates about the speed of convergence of the CC-ODE. For every channel considered and for every integer value of t, we have marked with diamonds in Figure 5 the Euclidean distance between the input distributions obtained via the CC-ODE at time t and the corresponding input distribution computed with the BAA. As shown in the figure, this quantity exhibits exponential decay after just a few iterations of the BAA.Moreover, we reported with a solid red line the Euclidean distance between the limit optimal input distribution and the points on the orbit produced with τ = 0.01 . The channels considered have a non-singular transition matrix, and their optimal input distribution does not have null entries. Consequently, Theorem 5 is applicable for both channels. As displayed in Figure 5, the solid red line decays roughly as the dotted grey line, which is the graph of a function of the form C exp ( λ max t ) , with λ max being the maximum eigenvalue discussed in Theorem 5, which equals 1 for the noiseless symmetric channel and is 0.1778 for the other. For the noiseless symmetric channel, we also report in Figure 5b the theoretical asymptotic behavior described in Theorem 7 for the error decay. It is possible to see that in Figure 5b, the computed error decay deviates from the expected behavior for t 32 . However, the plot also shows that the error stabilizes when it reaches a value close to 10 14 . Considering that the simulations were carried out using double precision, we interpret this outcome as a consequence of approaching the machine precision.
The interested reader may also see Ref. [29], where other integration schemes were tested in conjunction with Algorithm 1.
Algorithm 1 Discretizing the CC-ODE to compute the capacity [29].
Input:  τ , stepsize; Niter, maximum number of iterations; ε , tolerance; y , initialization point.
Output:  C ^ , estimated channel capacity; z ^ , estimated optimal input distribution; k, number of integration steps performed.
 1: k 0
 2: z y ▹ Initializing z .
 3: err +
 4: while  k < Niter err ε  do
 5:   z new SolverStep ( z , τ )
 6:   err z z new 1 ▹ Note: x 1 = i = 1 n | x i | is the 1 -norm of x .
 7:   z z new ▹ Updating z .
 8:   k k + 1
 9: end while
10: z ^ z
11: C ^ I ( z ^ )
12: return  C ^ , z ^ , k

9. Discussion

We report in this section some additional remarks related to the results treated so far:
  • In formulating Lemma 1, we tried to require only the essential hypotheses we used in its proof. Consequently, even though we apply Lemma 1 only in Proposition 2, it is applicable also in other, more general settings.
  • In the proof of Theorem 1, note that if Y 0 = , then Δ n Ω , and the thesis is just a mere application of classical results—see, e.g., Refs. [16,31]. By contrast, the proof deviates from a standard setting in case Y 0 , which implies Δ n Ω .
  • As far as the proposed circuits are concerned, note that z k k I ( z ) could be produced by considering a module that outputs k I ( z ) and whose output is then multiplied by z k . However, note that z z k k I ( z ) is a bounded function of the input by Proposition 4, which in general is not true for z k I ( z ) .
  • We also remark on the following feature of the circuit presented in Figure 1. Assume that, for whatever reason, there are u input symbols x i 1 , x i 2 , …, x i u that become unavailable, in the sense that any input distribution is now constrained to assign null probability to x i 1 , x i 2 , …, x i u . Indeed, this constraint on the channel C = ( X , Y , P ) corresponds to replacing it with C = ( X , Y , P ) , where X = X \ { x i k : k [ u ] } and P is obtained from P by removing the rows i 1 , i 2 , …, i u . In case every symbol of the output alphabet Y can still be obtained with positive probability, then the capacity of C may be computed without altering the circuit design displayed for C , but by simply modifying the initialization rules on z . Specifically, by Proposition 2.(d), it is sufficient to initialize z ( 0 ) Δ n so that z i ( 0 ) = 0 if and only if i = i 1 , i 2 , …, i u , and the dynamics evolves z so as to maximize I ( z ) under this constraint;
  • Note that noise could negatively affect the capacity computation, despite the normalization module of Figure 2. Indeed, suppose that a perturbation causes z to change its support at time t = t 1 , and that for t > t 1 , no additional perturbations affect the dynamics. Then it is possible that z converges towards an input distribution that is still optimal, but for the wrong channel, as in the case discussed in the previous remark. To check if this has occurred, a good idea could be to consider some “perturb and restart” strategies. For instance, when the capacity of C is sought in the presence of noise, assume that a trajectory initialized in some z ( 0 ) int ( Δ n ) converges to some z * Δ n . We may then restart the dynamics in some z int ( Δ n ) that is obtained by perturbing slightly z * , and then check whether I ( z ) converges once again to I ( z * ) , or if z converges to z * in case we know in advance that there exists a unique optimal input distribution.
  • By Proposition 4 and Peano’s theorem, the ODE (7) extended by continuity admits a solution for any initialization z ( 0 ) Δ n , regardless of whether z ( 0 ) lies in Ω or not. Indeed, a solution for (7) can always be found by considering the “subchannel” of C induced by the support of z ( 0 ) , similarly to the approach discussed before in Proposition 2.(d). However, in case z ( 0 ) Ω , note that the Picard–Lindelöf theorem cannot be applied, and we did not manage to prove that such a solution is unique. Despite this, for symmetric noiseless channels, note that the general integral (31) can be extended by continuity also for z ( 0 ) Δ n , and arguing by induction it is not hard to see that, for these trivial channels, solutions for the extended ODE are unique.

10. Conclusions

We performed a theoretical analysis on the flow of the CC-ODE, which rules a continuous-time dynamical system that enables us to compute the capacity, as well as an optimal input distribution, of a DMC. We showed that the proposed dynamical system can be regarded as a continuous-time version of the BAA and that, under some technical conditions, the flow rate of convergence to an optimal input distribution is exponential, with constants that correspond to those arising from iterating the Blahut–Arimoto map. We described possible implementations of the CC-ODE flow in a circuit that may compute analogically the capacity of a DMC, and we discussed how the circuit may be still useful in case the channel changes due to the unavailability of some input symbols.
Possible future work stemming from this paper might be the exploration of continuous-time dynamics to maximize mutual information under some additional constraints, as in [6], or for quantum channels [41]. Moreover, an interesting aspect of Theorem 3 is that it shows that the CC-ODE flow is related to some accelerated versions of the BAA [8] that correspond to an application of (17) with f = I but where the stepsize is adjusted at every iteration. This link suggests investigating strategies to speed up analog computation by accelerating the CC-ODE flow. Lastly, a theoretical question that we have not addressed is whether solutions of the ODE obtained by extending (7) by continuity on Δ n are unique, even though the hypotheses of the Picard-Lindelöf theorem are not met.
As mentioned by M. T. Chu in [42], continuous methods may help understand the corresponding discrete methods, and we hope that this paper may enrich our collective knowledge of the BAA, which is still a subject of active research.

Author Contributions

Conceptualization, G.B. and M.P.; methodology, G.B.; formal analysis, G.B.; investigation, G.B.; resources, M.P.; writing—original draft preparation, G.B.; writing—review and editing, G.B. and M.P.; visualization, G.B.; supervision, M.P., funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the support and the funding received from Politecnico di Torino and Università Ca’ Foscari di Venezia.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors thank Giacomo Chiarot and Antonio Emanuele Cinà for the support given during the preliminary stage of this work, and the anonymous reviewers for their helpful suggestions.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. ODEs on Compact Convex Sets

Let C R n be a closed convex set. Following Ref. [16], we recall that, for every z C , the tangent cone of C at z is the closed convex cone
T C ( z ) = cl { α ( x z ) x C , α R + }
where cl ( · ) denotes the closure in the usual Euclidean topology of R n . We report a technical result that deals with the solutions of ODEs over C in case it is also compact.
Theorem A1
(ODEs on compact convex sets [16] (Theorem 4.A.5.(i))). Let C R n be a compact convex set, and let v : C R n be Lipschitz continuous. Suppose that v ( x ^ ) T C ( x ^ ) for all x ^ C . Then, for each ξ C , there exists a unique x : 0 , + C satisfying x ˙ = v ( x ) and x ( 0 ) = ξ .

Appendix B. Support Invariance and Lyapunov Function

The following proposition reports some useful properties that are known for f admitting a Lipschitz continuous gradient on Δ n . For completeness, we provide the details that extend the known proof to our more general setting.
Proposition A1.
Let f : R n R admit a locally Lipschitz continuous gradient on some open Ω R n . Let a, b R , with a b , and let the function z : [ a , b ] Ω satisfy z ( t 0 ) Δ n for some t 0 [ a , b ] and
z ˙ = [ diag ( z ) z z ] f ( z ) , a t b .
(i)
We have z ( t ) Δ n for a t b ;
(ii)
The support of z ( t ) is invariant for a t b ;
(iii)
Either z is a constant function or f z is strictly increasing.
Proof. 
(i), (ii): It is sufficient to show that the function σ ( t ) = i = 1 n z i ( t ) satisfies σ ( t ) = 1 for every a t b , and that for every i [ n ] either z i 0 or z i ( t ) > 0 for every a t b . Consider first the function
v ( t , s ) = ( 1 s ) i = 1 n z i ( t ) i f z ( t ) ,
which is defined for ( t , s ) [ a , b ] × R and Lipschitz continuous in s uniformly with respect to t—see, e.g., Ref. [34]. The ODE s ˙ = v ( t , s ) is solved by σ and admits the constant solution s 1 . Since by hypothesis σ ( t 0 ) = 1 = s ( 1 ) , then σ ( t ) = s ( t ) = 1 for every a t b by the Picard-Lindelöf theorem. Similarly, for every i [ n ] , consider the function
u i ( t , x ) = x i f z ( t ) k = 1 n z k ( t ) k f z ( t ) ,
which is defined for ( t , x ) [ a , b ] × R and Lipschitz continuous in x uniformly with respect to t. The ODE x ˙ = u i ( t , x ) is solved by z i ( t ) and admits the constant solution x 0 . Consequently, since by hypothesis z ( t 0 ) 0 , by the Picard-Lindelöf theorem either x i 0 or x i ( t ) > 0 for every a t b .
(iii): Set μ ( x ) = i = 1 n x i i f ( x ) for every x Ω . Note that
d d t ( f z ) = i = 1 n i f ( z ) z ˙ i = i = 1 n z i [ i f ( z ) μ ( z ) ] 2 ,
and so
  • The function f z has non-negative derivative, and is thereby non-decreasing.
  • If for some t 1 the equality d / d t ( f z ) ( t 1 ) = 0 holds, then i f z ( t 1 ) μ z ( t 1 ) = 0 for every i supp ( z ( t 1 ) ) , and it is easy to check that in this case z ˙ ( t 1 ) = 0 —thus, z z ( t 1 ) by the Picard-Lindelöf theorem.

Appendix C. An Example

Set z * = ( 1 / 2 , 1 / 2 , 0 ) and choose some y int ( Δ 3 ) . Consider first the concave function f 1 C ( R 3 ) defined by f 0 ( z ) = z z * 2 . It is immediate to compute the Shahshahani gradient system associated with f 0 :
z ˙ = 2 [ diag ( z ) z z ] ( z z * ) .
Let z 0 : 0 , + Δ 3 solve (A1) and suppose z 0 ( 0 ) = y . Then z 0 z * as t + , and z * is a stationary point for (A1), as well as the global maximum of f over Δ 3 . Consider now the concave function f 1 defined on R 3 by f 1 ( z ) = z z * , and the associated Shahshahani gradient system:
z ˙ = | f 1 ( z ) | 1 [ diag ( z ) z z ] ( z z * ) .
Let z 1 : 0 , + Δ 3 solve (A2) and suppose z 1 ( 0 ) = y . By uniqueness of solutions, z 1 ( t ) = ( z 0 τ ) ( t ) , with τ ( t ) = 0 t 2 f 1 z 0 ( r ) 1 d r . Consequently, z * = lim t + z 1 ( t ) . However, f 1 is undefined in z * , and so (A2) is undefined in z = z * ; thus, z * is not a stationary point for (A2).

Appendix D. Proof of Claim 1

The proof of Claim 1 here given adapts an argument used in [16] (Theorem 7.2.4), where a similar result is proved for a more stringent setting, namely, for f strictly convex and smooth on Δ n .
Proof. 
Assume that z ( t ) is not a constant function, since otherwise the proof is trivial. Call Λ the set of limit points of z ( t ) as t + ; then note, by compactness of K, that Λ K , and fix some z * Λ . On D z * = { x R n supp ( x ) supp ( z * ) } , consider the function
h ( x ) = i supp ( z * ) z i * ln z i * x i .
Additional properties are known for h ( x ) if we constrain x in D z * Δ n , since in this case h ( x ) = D ( z * | | x ) , where the right-hand side denotes the relative entropy between z * and x —see, e.g., Ref. [3]. Specifically, we are interested in the following:
  • h ( x ) 0 , with equality only for x = z * .
  • x z * if and only if h ( x ) 0 .
Note also that z ( t ) D z * Δ n for every t by Proposition A1. Therefore,
  • We have h z 0 , and by definition of z * ,
    lim inf t + ( h z ) = 0 ;
  • The proof is concluded if we also show that
    lim t + ( h z ) = 0 .
By Proposition A1, the function f z is strictly increasing; hence, by continuity of f,
f ( z * ) = sup t 0 ( f z ) ( t ) = lim t + ( f z ) ( t ) .
Moreover, we can now prove that
d d t ( h z ) 0 .
In fact, using that i z i * = 1 , we obtain
d d t ( h z ) = i supp ( z * ) n z i * z ˙ i z i = i = 1 n z i * i f ( z ) k = 1 n z k k f ( z ) = k = 1 n z k k f ( z ) i = 1 n z i * i f ( z ) = i = 1 n ( z i z i * ) i f ( z ) = f ( z ) , z * z f ( z ) f ( z * ) .
Then (A7) follows using that
f ( z * ) f ( z ) + f ( z ) , z * z ,
which is a consequence of the concavity of f on K—see, e.g., Ref. [37]. By (A7), the non-negative function h z is non-increasing, and since we know that (A4) holds, then this yields (A5). □

Appendix E. Proof of Claim 2

Proof. 
The Lagrangian associated with (2) is the function L : Δ n × R × R n R given by
L ( x , μ 0 , μ ) = f ( x ) + μ 0 1 n , x 1 + μ , x ,
and the KKT conditions for x * amount to the following system (see, e.g., Ref. [43]):
(A9) x L x * , μ 0 , μ = f x * + μ 0 1 n + μ = 0 (A10) μ i x i * = 0 , i [ n ] (A11) μ 0 .
Note that (A9) is equivalent to μ = f ( x * ) + μ 0 1 n , which can be used to eliminate μ in (A10) and (A11), yielding
i f x * + μ 0 x i * = 0 , i [ n ] , f x * + μ 0 1 n 0 ,
i.e.,
i f x * = μ 0 , i supp x * , i f x * μ 0 , i [ n ] .
The thesis then follows by considering λ = μ 0 . □

Appendix F. Proof of Proposition 1

Proof. 
(a): Let k [ n ] and observe that
k I ( z ) = c k j = 1 m k ( q j ln q j ) ( z ) = c k j = 1 m { 1 + ln [ q j ( z ) ] } k q j ( z ) = c k j = 1 m p ( j | k ) j = 1 m p ( j | k ) ln q j ( z ) = c k 1 j = 1 m p ( j | k ) ln q j ( z ) .
(b): Apply (a) to obtain
k = 1 n z k k I ( z ) = k = 1 n z k c k 1 j = 1 m p ( j | k ) ln q j = k = 1 n c k z k k = 1 n z k j = 1 m q j ln q j = I ( z ) k = 1 n z k .
(c): By (a),
i , k 2 I ( z ) = j = 1 m p ( j | k ) q j ( z ) i q j ( z ) = j = 1 m p ( j | i ) p ( j | k ) q j ( z ) .
 □

References

  1. Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  2. MacKay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  3. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
  4. Csiszár, I.; Tusnády, G. Information geometry and alternating minimization procedures. Stat. Decis. 1984, 1, 205–237. [Google Scholar]
  5. Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef]
  6. Blahut, R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef]
  7. Yu, Y. Squeezing the Arimoto-Blahut algorithm for faster convergence. IEEE Trans. Inf. Theory 2010, 56, 3149–3157. [Google Scholar] [CrossRef]
  8. Matz, G.; Duhamel, P. Information geometric formulation and interpretation of accelerated Blahut-Arimoto-type algorithms. In 2004 IEEE Information Theory Workshop; IEEE Press: Piscataway, NJ, USA, 2004; pp. 66–70. [Google Scholar] [CrossRef]
  9. Nakagawa, K.; Takei, Y.; Hara, S.I.; Watabe, K. Analysis of the convergence speed of the Arimoto-Blahut algorithm by the second-order recurrence formula. IEEE Trans. Inf. Theory 2021, 67, 6810–6831. [Google Scholar] [CrossRef]
  10. Boche, H.; Schaefer, R.F.; Poor, H.V. Algorithmic computability and approximability of capacity-achieving input distributions. IEEE Trans. Inf. Theory 2023, 69, 5449–5462. [Google Scholar] [CrossRef]
  11. Helmke, U.; Moore, J.B. Optimization and Dynamical Systems; Springer: London, UK, 1994. [Google Scholar]
  12. Brockett, R.W. Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems. Linear Algebra Its Appl. 1991, 146, 79–91. [Google Scholar] [CrossRef]
  13. Arora, S.; Hazan, E.; Kale, S. The multiplicative weights update method. Theory Comput. 2012, 8, 121–164. [Google Scholar] [CrossRef]
  14. Naja, Z.; Alberge, F.; Duhamel, P. Geometrical interpretation and improvements of the Blahut-Arimoto’s algorithm. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 2505–2508. [Google Scholar] [CrossRef]
  15. Weibull, J. Evolutionary Game Theory; MIT Press: Cambridge, MA, USA, 1995. [Google Scholar]
  16. Sandholm, W.H. Population Games and Evolutionary Dynamics; MIT Press: Cambridge, MA, USA, 2010. [Google Scholar]
  17. Khalil, H.K. Nonlinear Systems, 3rd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2002. [Google Scholar]
  18. Roman, S. Coding and Information Theory; Graduate texts in mathematics; Springer: New York, NY, USA, 1992; Volume 134. [Google Scholar]
  19. MacLennan, B.J. Analog Computation. In Encyclopedia of Complexity and Systems Science; Meyers, R.A., Ed.; Springer: New York, NY, USA, 2009; pp. 271–294. [Google Scholar] [CrossRef]
  20. Taherkhani, A.; Belatreche, A.; Li, Y.; Cosma, G.; Maguire, L.P.; McGinnity, T.M. A review of learning in biologically plausible spiking neural networks. Neural Netw. 2020, 122, 253–272. [Google Scholar] [CrossRef]
  21. Wang, Q.; Meng, C.; Wang, C. Analog continuous-time filter designing for Morlet wavelet transform using constrained L2-norm approximation. IEEE Access 2020, 8, 121955–121968. [Google Scholar] [CrossRef]
  22. Schuman, C.D.; Kulkarni, S.R.; Parsa, M.; Mitchell, J.P.; Date, P.; Kay, B. Opportunities for neuromorphic computing algorithms and applications. Nat. Comput. Sci. 2022, 2, 10–19. [Google Scholar] [CrossRef] [PubMed]
  23. Edwards, J.; Parker, L.; Cardwell, S.G.; Chance, F.S.; Koziol, S. Neural-inspired dendritic multiplication using a reconfigurable analog integrated circuit. In Proceedings of the 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, 19–22 May 2024. [Google Scholar] [CrossRef]
  24. Hasler, J. Energy-efficient programable analog computing. IEEE Solid-State Circuits Mag. 2024, 16, 32–40. [Google Scholar] [CrossRef]
  25. Hasler, J.; Hao, C. Programmable analog system benchmarks leading to efficient analog computation synthesis. ACM Trans. Reconfig. Technol. Syst. 2024, 17, 1–25. [Google Scholar] [CrossRef]
  26. Parker, L.; Cardwell, S.G.; Chance, F.S.; Koziol, S. Bio-inspired active silicon dendrite for direction selectivity. In Proceedings of the 2024 International Conference on Neuromorphic Systems (ICONS), Arlington, VA, USA, 30 July–2 August 2024; pp. 343–349. [Google Scholar] [CrossRef]
  27. Ulmann, B. Beyond zeros and ones—analog computing in the twenty-first century. Int. J. Parallel Emergent Distrib. Syst. 2024, 39, 139–151. [Google Scholar] [CrossRef]
  28. Torsello, A.; Pelillo, M. Continuous-time relaxation labeling processes. Pattern Recognit. 2000, 33, 1897–1908. [Google Scholar] [CrossRef]
  29. Beretta, G.; Chiarot, G.; Cinà, A.E.; Pelillo, M. Computing the capacity of discrete channels using vector flows. In Dynamics of Information Systems: 7th International Conference, DIS 2024, Kalamata, Greece, 2–7 June 2024, Revised Selected Papers; Moosaei, H., Kotsireas, I., Pardalos, P.M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 14661, pp. 119–128. [Google Scholar] [CrossRef]
  30. Faybusovich, L. Dynamical systems which solve optimization problems with linear constraints. IMA J. Math. Control Inf. 1991, 8, 135–149. [Google Scholar] [CrossRef]
  31. Alvarez, F.; Bolte, J.; Brahic, O. Hessian Riemannian gradient flows in convex programming. SIAM J. Control Optim. 2004, 43, 477–501. [Google Scholar] [CrossRef]
  32. Hofbauer, J.; Sigmund, K. Evolutionary Games and Population Dynamics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
  33. Birkhoff, G.; Rota, G.C. Ordinary Differential Equations; Wiley: Hoboken, NJ, USA, 1989. [Google Scholar]
  34. Teschl, G. Ordinary Differential Equations and Dynamical Systems; American Mathematical Society: Providence, RI, USA, 2024. [Google Scholar]
  35. Hirsch, M.W.; Smale, S.; Devaney, R.L. Differential Equations, Dynamical Systems, and an Introduction to Chaos, 3rd ed.; Academic Press: Waltham, MA, USA, 2013. [Google Scholar]
  36. Mertikopoulos, P.; Sandholm, W.H. Riemannian game dynamics. J. Econ. Theory 2018, 177, 315–364. [Google Scholar] [CrossRef]
  37. Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
  38. Boyd, S.P.; El Ghaoui, L.; Feron, E.; Balakrishnan, V. Linear Matrix Inequalities in System and Control Theory; SIAM Studies in Applied Mathematics; SIAM: Philadelphia, PA, USA, 1994; Volume 15. [Google Scholar] [CrossRef]
  39. Hopfield, J.J. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA 1984, 81, 3088–3092. [Google Scholar] [CrossRef]
  40. Cichocki, A.; Unbehauen, R. Neural Networks for Optimization and Signal Processing; Wiley: Hoboken, NJ, USA, 1993. [Google Scholar]
  41. He, K.; Saunderson, J.; Fawzi, H. A Bregman proximal perspective on classical and quantum Blahut-Arimoto algorithms. IEEE Trans. Inf. Theory 2024, 70, 5710–5730. [Google Scholar] [CrossRef]
  42. Chu, M.T. On the continuous realization of iterative processes. SIAM Rev. 1988, 30, 375–387. [Google Scholar] [CrossRef]
  43. Luenberger, D.G.; Ye, Y. Linear and Nonlinear Programming; Springer: Cham, Switzerland, 2016. [Google Scholar]
Figure 1. Ideal circuit.
Figure 1. Ideal circuit.
Entropy 27 00362 g001
Figure 2. Circuit with normalizing module.
Figure 2. Circuit with normalizing module.
Entropy 27 00362 g002
Figure 3. Evolution of CC-ODE and BAA orbits. (a) Channel with transition matrix P ( 1 ) , systems initialized in y = ( 0.333 , 0.333 , 0.333 ) . (b) Channel with transition matrix P ( 2 ) , systems initialized in y = ( 0.700 , 0.200 , 0.100 ) .
Figure 3. Evolution of CC-ODE and BAA orbits. (a) Channel with transition matrix P ( 1 ) , systems initialized in y = ( 0.333 , 0.333 , 0.333 ) . (b) Channel with transition matrix P ( 2 ) , systems initialized in y = ( 0.700 , 0.200 , 0.100 ) .
Entropy 27 00362 g003
Figure 4. Evolution of I ( z ) computed along CC-ODE and BAA orbits. (a) Channel with transition matrix P ( 1 ) , systems initialized in y = ( 0.333 , 0.333 , 0.333 ) . (b) Channel with transition matrix P ( 2 ) , systems initialized in y = ( 0.700 , 0.200 , 0.100 ) .
Figure 4. Evolution of I ( z ) computed along CC-ODE and BAA orbits. (a) Channel with transition matrix P ( 1 ) , systems initialized in y = ( 0.333 , 0.333 , 0.333 ) . (b) Channel with transition matrix P ( 2 ) , systems initialized in y = ( 0.700 , 0.200 , 0.100 ) .
Entropy 27 00362 g004
Figure 5. Speed of convergence of the CC-ODE: (a) Channel with transition matrix P ( 1 ) , systems initialized in y = ( 0.333 , 0.333 , 0.333 ) . (b) Channel with transition matrix P ( 2 ) , systems initialized in y = ( 0.700 , 0.200 , 0.100 ) . Diamonds are used for comparison versus the BAA and the solid red line reports the distance of z with the limit point of the dynamics. The dotted grey lines provide estimates on the minimal limiting steepness of the red line, as a consequence of Theorem 5. The dashed black line in (b) gives the theoretical asymptotic behavior described in Theorem 7.
Figure 5. Speed of convergence of the CC-ODE: (a) Channel with transition matrix P ( 1 ) , systems initialized in y = ( 0.333 , 0.333 , 0.333 ) . (b) Channel with transition matrix P ( 2 ) , systems initialized in y = ( 0.700 , 0.200 , 0.100 ) . Diamonds are used for comparison versus the BAA and the solid red line reports the distance of z with the limit point of the dynamics. The dotted grey lines provide estimates on the minimal limiting steepness of the red line, as a consequence of Theorem 5. The dashed black line in (b) gives the theoretical asymptotic behavior described in Theorem 7.
Entropy 27 00362 g005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Beretta, G.; Pelillo, M. Vector Flows That Compute the Capacity of Discrete Memoryless Channels. Entropy 2025, 27, 362. https://doi.org/10.3390/e27040362

AMA Style

Beretta G, Pelillo M. Vector Flows That Compute the Capacity of Discrete Memoryless Channels. Entropy. 2025; 27(4):362. https://doi.org/10.3390/e27040362

Chicago/Turabian Style

Beretta, Guglielmo, and Marcello Pelillo. 2025. "Vector Flows That Compute the Capacity of Discrete Memoryless Channels" Entropy 27, no. 4: 362. https://doi.org/10.3390/e27040362

APA Style

Beretta, G., & Pelillo, M. (2025). Vector Flows That Compute the Capacity of Discrete Memoryless Channels. Entropy, 27(4), 362. https://doi.org/10.3390/e27040362

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop