Next Article in Journal
Automatic Controversy Detection Based on Heterogeneous Signed Attributed Network and Deep Dual-Layer Self-Supervised Community Analysis
Previous Article in Journal
The Knudsen Layer in Modeling the Heat Transfer at Nanoscale: Bulk and Wall Contributions to the Local Heat Flux
Previous Article in Special Issue
Classical Data in Quantum Machine Learning Algorithms: Amplitude Encoding and the Relation Between Entropy and Linguistic Ambiguity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Kolmogorov Capacity with Overlap †

by
Anshuka Rangi
and
Massimo Franceschetti
*
Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, Mail Code 0407, La Jolla, CA 92093-0407, USA
*
Author to whom correspondence should be addressed.
This article is a revised and expanded version of a paper entitled Insights into [Towards a Non-Stochastic Information Theory], which was presented at [2019 IEEE International Symposium of Information Theory, Paris, France, 7–12 July 2019] and [Channel Coding Theorems in Non-Stochastic Information Theory], which was presented at [2021 IEEE International Symposium of Information Theory, Melbourne, VIC, Australia, 12–20 July 2021].
Entropy 2025, 27(5), 472; https://doi.org/10.3390/e27050472
Submission received: 4 March 2025 / Revised: 9 April 2025 / Accepted: 10 April 2025 / Published: 27 April 2025
(This article belongs to the Collection Feature Papers in Information Theory)

Abstract

:
The notion of δ -mutual information between non-stochastic uncertain variables is introduced as a generalization of Nair’s non-stochastic information functional. Several properties of this new quantity are illustrated and used in a communication setting to show that the largest δ -mutual information between received and transmitted codewords over ϵ -noise channels equals the ( ϵ , δ ) -capacity. This notion of capacity generalizes the Kolmogorov ϵ -capacity to packing sets of overlap at most δ and is a variation of a previous definition proposed by one of the authors. Results are then extended to more general noise models, including non-stochastic, memoryless, and stationary channels. The presented theory admits the possibility of decoding errors, as in classical information theory, while retaining the worst-case, non-stochastic character of Kolmogorov’s approach.

1. Introduction

Shannon’s celebrated channel coding theorem states that the capacity is the supremum of the mutual information between the input and the output of the channel [1]. In this setting, the mutual information is intended as the amount of information obtained regarding the random variable at the input of the channel by observing the random variable at the output of the channel, and the capacity is the largest rate of communication that can be achieved with an arbitrarily small probability of error. In an effort to provide an analogous result for safety-critical control systems where occasional decoding errors can result in catastrophic failures, Nair introduced a non-stochastic mutual information functional and established that this equals the zero-error capacity [2], namely the largest rate of communication that can be achieved with zero probability of error. Nair’s approach is based on the calculus of non-stochastic uncertain variables (UVs), and his definition of mutual information in a non-stochastic setting is based on the quantization of the range of uncertainty of a UV induced by the knowledge of the other. While Shannon’s theorem leads to a single letter expression, Nair’s result is multi-letter, involving the non-stochastic information between codeword blocks of n symbols. The zero-error capacity can also be formulated as a graph-theoretic property, and the absence of a single-letter expression for general graphs is well known [3,4]. Extensions of Nair’s non-stochastic approach to characterize the zero-error capacity in the presence of feedback from the receiver to the transmitter using nonstochastic directed mutual information have been considered in [5].
Kolmogorov introduced the notion of ϵ -capacity in the context of functional spaces as the logarithm base two of the packing number of the space, namely the logarithm of the maximum number of balls of radius ϵ that can be placed in the space without overlap [6]. Determining this number is analogous to designing a channel codebook such that the distance between any two codewords is at least 2 ϵ . In this way, any transmitted codeword that is subject to a perturbation of, at most, ϵ can be recovered at the receiver without error. It follows that the ϵ -capacity per transmitted symbol (viz. per signal dimension) corresponds to the zero-error capacity of an additive channel having arbitrary bounded noise of the radius at most ϵ . Lim and Franceschetti extended this concept introducing the ( ϵ , δ ) capacity [7], defined as the logarithm base two of the largest number of balls of radius ϵ that can be placed in the space with an average codeword overlap of, at most, δ . In this setting, δ measures the amount of error that can be tolerated when designing a codebook in a non-stochastic setting, and the ( ϵ , δ ) capacity per transmitted symbol corresponds to the largest rate of communication with error at most δ .
The first contribution of this paper is to consider a generalization of Nair’s mutual information based on a quantization of the range of uncertainty of a UV given the knowledge of another, which reduces the uncertainty to, at most, δ , and to show that this new notion corresponds to the ( ϵ , δ ) capacity. Our definition of ( ϵ , δ ) capacity is a variation of the one in [7], as it is required to bound the overlap between any pair of balls, rather than the average overlap. For δ = 0 , we recover Nair’s result for the Kolmogorov ϵ -capacity or, equivalently, for the zero-error capacity of an additive, bounded noise channel. We then extend the results to more general channels where the noise can be different across codewords and is not necessarily contained within a ball of radius ϵ . Finally, we consider the class of non-stochastic, memoryless, stationary uncertain channels, where the noise experienced by a codeword of n symbols factorizes into n identical terms describing the noise experienced by each codeword symbol. This is the non-stochastic analog of a discrete memoryless channel (DMC), where the current output symbol depends only on the current input symbol, not on any of the previous input symbols, and where the noise distribution is constant across symbol transmissions and differs from Kolmogorov’s ϵ -noise channel, where the noise experienced by one symbol affects the noise experienced by other symbols (in Kolmogorov’s setting, the noise occurs within a ball of radius ϵ . It follows that for any realization where the noise along one dimension (viz. symbol) is close to ϵ , the noise experienced by all other symbols lying in the remaining dimensions must be close to zero.). Letting ( 1 δ n ) be the confidence of correct decoding after transmitting n symbols, we introduce several notions of capacity and establish coding theorems in terms of mutual information for all of them, including a generalization of the zero-error capacity that requires the error sequence { δ n } to remain constant and a non-stochastic analog of Shannon’s capacity that requires the error sequence to vanish, as in n .
Finally, since in Nair’s case, all of our results are multi-letter, in the Supplementary Materials, we provide some sufficient conditions for the factorization of the mutual information leading to a single-letter expression for the non-stochastic capacity of stationary, memoryless, uncertain channels, provide some examples in which these conditions are satisfied, and compute the corresponding capacity.
The rest of the paper is organized as follows: Section 2 introduces the mathematical framework of non-stochastic uncertain variables that are used throughout the paper. Section 3 introduces the concept of non-stochastic mutual information. Section 4 gives an operational definition of the capacity of a communication channel and relates it to the mutual information. Section 5 extends the results to more general channel models, and Section 6 concentrates on the special case of stationary, memoryless, uncertain channels. Section 7 draws conclusions and discusses future directions.

2. Uncertain Variables

We start by reviewing the mathematical framework used in [2] to describe UVs. A UV X is a mapping from a sample space Ω to a set X , i.e., for all ω Ω , we have x = X ( ω ) X , namely
X : Ω X ω X ( ω ) = x .
Given a UV X, the marginal range of X is
X = { X ( ω ) : ω Ω } .
The joint range of the two UVs X and Y is
X , Y = { ( X ( ω ) , Y ( ω ) ) : ω Ω } .
Given a UV Y, the conditional range of X given Y = y is
X | y = { X ( ω ) : Y ( ω ) = y , ω Ω } ,
and the conditional range of X given Y is
X | Y = { X | y : y Y } .
Thus, X | Y denotes the uncertainty in X, given the realization of Y and that X , Y represents the total joint uncertainty of X and Y, namely
X , Y = y Y X | y × { y } .
Finally, two UVs, X and Y, are independent if for all x X ,
Y | x = Y ,
which also implies that for all y Y ,
X | y = X .

3. δ -Mutual Information

3.1. Uncertainty Function

We now introduce a class of functions that are used to express the amount of uncertainty in determining one UV given another. In our setting, an uncertainty function associates a positive number with a given set, which expresses the “massiveness” or “size” of that set.
Definition 1. 
Given the set of non-negative real numbers R 0 + and any set of X , m X : 2 X R 0 + is an uncertainty function if it is finite and strongly transitive:
We have m X ( ) = 0 , and for all S X , S ,
0 < m X ( S ) < .
For all S 1 , S 2 X , we have
max { m X ( S 1 ) , m X ( S 2 ) } m X ( S 1 S 2 ) .
In the case where X is measurable, an uncertainty function can easily be constructed using a measure. In the case where X is a bounded (not necessarily measurable) metric space and the input set S contains at least two points, an example of an uncertainty function is the diameter.

3.2. Association and Dissociation Between UVs

We now introduce notions of association and dissociation between UVs. In the following definitions, we let m X ( . ) and m Y ( . ) be uncertainity functions defined over sets X and Y corresponding to UVs X and Y. We use the notation A δ to indicate that for all a A , we have a > δ . Similarly, we use A δ to indicate that for all a A , we have a δ , where A R and δ R . For A = , we assume that A δ is always satisfied, while A δ is not. Whenever we consider i j , we also assume that y i y j and x i x j , where x i , x j X and y i , y j Y .
Definition 2. 
The sets of association for UVs X and Y are
A ( X ; Y ) = m X ( X | y 1 X | y 2 ) m X ( X ) : y 1 , y 2 Y 0 ,
A ( Y ; X ) = m Y ( Y | x 1 Y | x 2 ) m Y ( Y ) : x 1 , x 2 X 0 .
The sets of association are used to describe the correlation between two uncertain variables. Since m X ( ) = 0 , the exclusion of the zero value in (11) and (12) occurs when there is no overlap between the two conditional ranges.
Definition 3. 
For any δ 1 , δ 2 [ 0 , 1 ) , UVs X and Y are disassociated at levels ( δ 1 , δ 2 ) if the following inequalities hold:
A ( X ; Y ) δ 1 ,
A ( Y ; X ) δ 2 ,
and, in this case, we write ( X , Y ) d ( δ 1 , δ 2 ) .
Having UVs X and Y be disassociated at levels ( δ 1 , δ 2 ) indicates that at least two conditional ranges X | y 1 and X | y 2 have non-zero overlap and that, given any two conditional ranges, either they do not overlap or the uncertainty associated with their overlap is greater than a δ 1 fraction of the total uncertainty associated with X ; the same holds for conditional ranges Y | x 1 and Y | x 2 and for level δ 2 . The levels of disassociation can be viewed as lower bounds in the amount of residual uncertainty in each variable when the other is known. If X and Y are independent, then all the conditional ranges completely overlap, A ( X ; Y ) and A ( Y ; X ) contain only the element one, and the variables are maximally disassociated (see Figure 1a).
In this case, knowledge of Y does not reduce the uncertainty of X, and vice versa. On the other hand, when the uncertainty associated with any of the non-zero intersections of the conditional ranges decreases but remains positive, X and Y become less disassociated in the sense that knowledge of Y can reduce the residual uncertainty of X, and vice versa (see Figure 1b). When the intersection between every pair of conditional ranges becomes empty, the variables cease to be disassociated (see Figure 1c). Note that excluding the value of 1 in the definition of disassociation allows us to distinguish the case of disassociation from the case of full independence.
An analogous definition of association is given to provide upper bounds on the residual uncertainty of one uncertain variable when the other is known.
Definition 4. 
For any δ 1 , δ 2 [ 0 , 1 ] , we say that UVs X and Y are associated at levels ( δ 1 , δ 2 ) if the following inequalities hold:
A ( X ; Y ) δ 1 ,
A ( Y ; X ) δ 2 ,
and in this case, we write ( X , Y ) a ( δ 1 , δ 2 ) .
Note that δ 1 , δ 2 = 1 is included in Definition 4 and not in Definition 3. This is because in Definition 3, we have a strict lower bound on the uncertainty in the association sets.
The following lemma provides the necessary and sufficient conditions for association to hold at given levels. These conditions are stated for all points in the marginal ranges Y and X . They show that in the case of association, one can also include in the definition the conditional ranges that have zero intersection. This is not the case for disassociation.
Lemma 1. 
For any δ 1 , δ 2 [ 0 , 1 ] , ( X , Y ) a ( δ 1 , δ 2 ) if and only if for all y 1 , y 2 Y , we have
m X ( X | y 1 X | y 2 | ) m X ( X ) δ 1 ,
and for all x 1 , x 2 X , we have
m Y ( Y | x 1 Y | x 2 | ) m Y ( Y ) δ 2 .
Proof. 
The proof is given in Appendix A. □
An immediate yet important consequence of our definitions is that both association and disassociation at given levels ( δ 1 , δ 2 ) cannot hold simultaneously. We also understand that, given any two UVs, one can always select δ 1 and δ 2 to be large enough such that they are associated at levels ( δ 1 , δ 2 ) . In contrast, as the smallest value in the sets A ( X ; Y ) and A ( Y ; X ) tends towards zero, the variables eventually cease to be disassociated. Finally, it is possible that two uncertain variables are neither associated nor disassociated at given levels ( δ 1 , δ 2 ) . Also, any two uncertain variables are associated at level ( 1 , 1 ) trivially by definition.
Example 1. 
Consider three individuals, a, b, and c, are going for a walk along a path. Assume they take, at most, 15, 20, and 10 min to finish their walk, respectively. Assume a starts walking at time 5:00, b starts walking at 5:10, and c starts walking at 5:20. Figure 2 shows the possible time intervals for the walkers on the path. Let an uncertain variable W represent the set of walkers that are present on the path at any time and an uncertain variable T represent the time at which any walker on the path finishes their walk. Then, we have the following marginal ranges:
W = { { a } , { b } , { c } , { a , b } , { b , c } } ,
T = [ 5 : 00 , 5 : 30 ] .
We also have the following conditional ranges:
T | { a } = [ 5 : 00 , 5 : 15 ] ,
T | { b } = [ 5 : 10 , 5 : 30 ] ,
T | { c } = [ 5 : 20 , 5 : 30 ] ,
T | { a , b } = [ 5 : 10 , 5 : 15 ] ,
T | { b , c } = [ 5 : 20 , 5 : 30 ] .
For all t ∈ [5:00, 5:10), we have
W | t = { { a } } ;
for all t ∈ [5:10, 5:15], we have
W | t = { { a , b } , { a } , { b } } ;
for all t ∈ (5:15, 5:20), we have
W | t = { { b } } ;
and for all t ∈ [5:20, 5:30], we have
W | t = { { b , c } , { b } , { c } } .
Now, let the uncertainty function of a time set S be
m T ( S ) = L ( S ) + 10   i f   S , 0   o t h e r w i s e ,
where L ( · ) is the Lebesgue measure. Let the uncertainty function m W ( . ) associated with a set of individuals be the cardinality of the set. Then, the sets of association are
A ( W ; T ) = { 1 / 5 , 3 / 5 } ,
A ( T ; W ) = { 3 / 8 , 1 / 2 } .
It follows that for all δ 1 < 1 / 5 and δ 2 < 3 / 8 , we have
( W , T ) d ( δ 1 , δ 2 ) ,
and the residual uncertainty in W given T is at least a δ 1 fraction of the total uncertainty in W, while the residual uncertainty in T given W is at least a δ 2 fraction of the total uncertainty in T. On the other hand, for all δ 1 3 / 5 and δ 2 1 / 2 , we have
( W , T ) a ( δ 1 , δ 2 ) ,
and the residual uncertainty in W given T is at most a δ 1 fraction of the total uncertainty in W, while the residual uncertainty in T given W is at most a δ 2 fraction of the total uncertainty in T.
Finally, if 1 / 5 δ 1 < 3 / 5 or 3 / 8 δ 2 < 1 / 2 , then W and T are neither associated nor disassociated.

3.3. δ -Mutual Information

We now introduce the mutual information between uncertain variables in terms of some structural properties of covering sets. Intuitively, for any δ [ 0 , 1 ] the δ -mutual information, expressed in bits, represents the most refined knowledge that one uncertain variable provides about the other at a given level of confidence ( 1 δ ) . We express this idea by considering the quantization of the range of uncertainty of one variable, induced by the knowledge of the other. Such quantization ensures that the variable can be identified with uncertainty at most δ . The notions of association and disassociation introduced above are used to ensure that the mutual information is well defined, in that it can be positive and exhibits a certain symmetric property.
Definition 5. 
δ-Connectedness and δ-isolation.
  • For any δ [ 0 , 1 ] , points x 1 , x 2 X are δ-connected via X | Y and are denoted by x 1 δ x 2 if there exists a finite sequence { X | y i } i = 1 N of conditional sets such that x 1 X | y 1 , x 2 X | y N and for all 1 < i N , we have
    m X ( X | y i X | y i 1 ) m X ( X ) > δ .
    If x 1 δ x 2 and N = 1 , then we say that x 1 and x 2 are singly δ-connected via X | Y , i.e., there exists a y such that x 1 , x 2 X | y .
  • A set S X is δ-connected via X | Y if every pair of points in the set is δ-connected via X | Y .
  • A set S X is singly δ-connected via X | Y if there exists a y Y such that every point in the set is contained in X | y , namely S X | y .
  • Two sets S 1 , S 2 X are δ-isolated via X | Y if no point in S 1 is δ-connected to any point in S 2 .
Example 2. 
Consider the same setting discussed in Example 1. For δ = 2 / 8 , two points at times 5:05 and 5:25 T are δ-connected. The sequence of conditional sets connecting the two points is { T | { a } , T | { b } } , where the sets are defined in (21) and (22). This is because 5:05 T|{a}, 5:25 T|{b} and
m T ( T | { a } T | { b } ) m T ( T ) = 3 8 > δ .
For all δ 0 , two points at times 5:00 and 5:05 T are singly δ-connected since 5:00, 5:05 T | { a } .
Likewise, for all δ 0 , the set S = { T | { a } } is singly δ-connected by definition.
For δ = 2 / 8 , the set S = { T | { a } , T | { b } } is δ-connected. This is because for all x 1 , x 2 S , one of the following scenarios holds:
  • x 1 , x 2 T | { a } .
  • x 1 , x 2 T | { b } .
  • x 1 T | { a } and x 2 T | { b } , or vice-versa.
In the first two scenarios, the points x 1 and x 2 are singly δ-connected. In the third scenario, the points are δ-connected since
m T ( T | { a } T | { b } ) m T ( T ) = 3 8 > δ .
For all δ > 0 , the two sets T | { a } and T | { c } are δ-isolated since there is no overlap between the two sets.
Definition 6. 
δ-overlap family.
For any δ [ 0 , 1 ] , a X | Y δ-overlap family of X , denoted by X | Y δ , is the largest family of distinct sets covering X , such that
  • Each set in the family is δ-connected and contains at least one singly δ-connected set of the form X | y .
  • The measure of overlap between any two distinct sets in the family is at most δ m X ( X ) , namely for all S 1 , S 2 X | Y δ , such that S 1 S 2 ; we also have m X ( S 1 S 2 ) δ m X ( X ) .
  • For every singly δ-connected set, there exist a set in the family containing it.
The first property of the δ -overlap family ensures that points in the same set of the family cannot be distinguished with confidence of at least ( 1 δ ) , while also ensuring that each set cannot be arbitrarily small. The second and third properties ensure that points that are not covered by the same set of the family can be distinguished with confidence of at least ( 1 δ ) . It follows that the cardinality of the covering family represents the most refined knowledge at a given level of confidence ( 1 δ ) that we can have about X, given the knowledge of Y. This also corresponds to the most refined quantization of the set X induced by Y. This interpretation is analogous to the one in [2], extending the concept of overlap partition introduced there to a δ -overlap family in this work. The stage is now set to introduce the δ -mutual information in terms of the δ -overlap family.
Definition 7. 
The δ-mutual information provided by Y about X is
I δ ( X ; Y ) = log 2 | X | Y δ | b i t s ,
if a X | Y δ-overlap family of X exists; otherwise, it is zero.
We now show that when variables are associated at level ( δ , δ 2 ) , there exists a δ -overlap family, so that the mutual information is well defined.
Theorem 1. 
If ( X , Y ) a ( δ , δ 2 ) , then there exists a δ-overlap family X | Y δ .
Proof. 
We show that
X | Y = { X | y : y Y }
satisfies all the three properties of δ -overlap family in Definition 6. First, note that X | Y is a cover of X , since X = y Y X | y , even though X | y for different y may overlap with each other. Second, each set in the family X | Y is singly δ -connected via X | Y , since trivially, any two points x 1 , x 2 X | y are singly δ -connected via the same set. It follows that Property 1 of Definition 6 holds.
Now, since ( X , Y ) a ( δ , δ 2 ) , then by Lemma 1, for all y 1 , y 2 Y , we have
m X ( X | y 1 X | y 2 ) m X ( X ) δ ,
which shows that Property 2 of Definition 6 holds. Finally, it is also easy to see that Property 3 of Definition 6 holds, since X | Y contains all sets X | y . Hence, X | Y satisfies all the three properties in Definition 6, which implies that there exists at least one set satisfying these conditions. Hence, the maximum over these sets is defined and the claim follows. □
Next, we show that a δ -overlap family also exists when variables are disassociated at level ( δ , δ 2 ) . In this case, we also characterize the mutual information in terms of a partition of X .
Definition 8. 
δ-isolated partition.
A X | Y δ-isolated partition of X , denoted by X | Y δ , is a partition of X such that any two sets in the partition are δ-isolated via X | Y .
Theorem 2. 
If ( X , Y ) d ( δ , δ 2 ) , then the following holds:
1. 
There exists a unique δ-overlap family X | Y δ .
2. 
The δ-overlap family is the δ-isolated partition of largest cardinality, in that, for any X | Y δ , we have
| X | Y δ | | X | Y δ | ,
where the equality holds if and only if X | Y δ = X | Y δ .
Proof. 
First, we show the existence of a δ -overlap family. For all x X , let C ( x ) be the set of points that are δ -connected to x via X | Y , namely
C ( x ) = { x 1 X : x δ x 1 } .
Then, we let
C = { C ( x ) : x X } ,
and show that this is a δ -overlap family. First, note that since X = S C S , we know that C is a cover of X . Second, for all C ( x ) C , there exists a y Y such that x X | y , and since any two points x 1 , x 2 X | y are singly δ -connected via X | Y , we understand that X | y C ( x ) . It follows that every set in the family C contains at least one singly δ -connected set. For all x 1 , x 2 C ( x ) , we also have x 1 δ x and x δ x 2 . Since ( X , Y ) d ( δ , δ 2 ) , by Lemma A2 in Appendix C, this implies that x 1 δ x 2 . It follows that every set in the family C is δ -connected and contains at least one singly δ -connected set, and we conclude that Property 1 of Definition 6 is satisfied.
We now claim that for all x 1 , x 2 X , if
C ( x 1 ) C ( x 2 ) ,
then
m X ( C ( x 1 ) C ( x 2 ) ) = 0 .
This can be proven by contradiction. Let C ( x 1 ) C ( x 2 ) and assume that m X ( C ( x 1 ) C ( x 2 ) ) 0 . By (9), this implies that C ( x 1 ) C ( x 2 ) . We can then select z C ( x 1 ) C ( x 2 ) , such that we have z δ x 1 and z δ x 2 . Since ( X , Y ) d ( δ , δ 2 ) , by Lemma A2 in Appendix C, this also implies that x 1 δ x 2 , and, therefore, C ( x 1 ) = C ( x 2 ) , which is a contradiction. It follows that if C ( x 1 ) C ( x 2 ) , then we must have m X ( C ( x 1 ) C ( x 2 ) ) = 0 , and, therefore,
m X ( C ( x 1 ) C ( x 2 ) ) m X ( X ) = 0 δ .
We conclude that Property 2 of Definition 6 is satisfied.
Finally, we understand that for any singly δ -connected set X | y , there exist an x X such that x X | y , which by, (42), implies that X | y C ( x ) . Namely, for every singly δ -connected set, there exist a set in the family containing it. We can then conclude that C satisfies all the properties of a δ -overlap family.
Next, we show that C is a unique δ -overlap family, which implies that this is also the largest set satisfying the three conditions in Definition 6. By contradiction, consider another δ -overlap family D . For all x X , let D ( x ) denote a set in D containing x. Then, using the definition of C ( x ) and the fact that D ( x ) is δ -connected, it follows that
D ( x ) C ( x ) .
Next, we show that for all x X , we also have
C ( x ) D ( x ) ,
from which, we conclude that D = C .
The proof of (48) is also obtained by contradiction. Assume there exists a point x ˜ C ( x ) D ( x ) . Since both x and x ˜ are contained in C ( x ) , x ˜ δ x . Let x be a point in a singly connected set that is contained in D ( x ) , namely x X | y D ( x ) . Since both x and x are in D ( x ) , we understand that x δ x . Since ( X , Y ) d ( δ , δ 2 ) , we can apply Lemma A2 in Appendix C to conclude that x ˜ δ x . It follows that there exists a sequence of conditional ranges { X | y i } i = 1 N such that x ˜ X | y 1 and x X | y N , which satisfies (35). Since x is in both X | y N and X | y , we obtain X | y N X | y , and since ( X , Y ) d ( δ , δ 2 ) , we obtain
m X ( X | y N X | y ) m X ( X ) > δ .
Without loss of generality, we can then assume that the last element of our sequence is X | y . By Property 3 of Definition 6, every conditional range in the sequence must be contained in some set of the δ -overlap family D . Since X | y D ( x ) and X | y 1 D ( x ) , it follows that there exist two consecutive conditional ranges along the sequence and two sets of the δ -overlap family covering them, such that X | y i 1 D ( x i 1 ) , X | y i D ( x i ) , and D ( x i 1 ) D ( x i ) . Then, we have
m X ( D ( x i 1 ) D ( x i ) ) = m X ( ( X | y i 1 X | y i ) ( D ( x i 1 ) D ( x i ) ) ) ( a ) m X ( X | y i 1 X | y i ) > ( b ) δ m X ( X ) ,
where ( a ) follows from (10) and ( b ) follows from (35). It follows that
m X ( D ( x i 1 ) D ( x i ) ) m X ( X ) > δ ,
and Property 2 of Definition 6 is violated. Thus, x ˜ does not exists, which implies C ( x ) D ( x ) . Combining (47) and (48), we conclude that the δ -overlap family C is unique.
We now turn to the proof of the second part of the theorem. Since by (46), the uncertainty associated with the overlap between any two sets of the δ -overlap family C is zero, it follows that C is also a partition.
Now, we show that C is also a δ -isolated partition. This can be proven by contradiction. Assume that C is not a δ -isolated partition. Then, there exists two distinct sets C ( x 1 ) , C ( x 2 ) C such that C ( x 1 ) and C ( x 2 ) are not δ -isolated. This implies that there exists a point x ¯ 1 C ( x 1 ) and x ¯ 2 C ( x 2 ) such that x ¯ 1 δ x ¯ 2 . Using the fact that C ( x 1 ) and C ( x 2 ) are δ -connected and Lemma A2 in Appendix C, this implies that all points in the set C ( x 1 ) are δ -connected to all points in the set C ( x 2 ) . Now, let x 1 and x 2 be points in a singly δ -connected set contained in C ( x 1 ) and C ( x 2 ) , respectively: x 1 X | y 1 C ( x 1 ) and x 2 X | y 2 C ( x 2 ) . Since x 1 δ x 2 , there exists a sequence of conditional ranges { X | y i } i = 1 N satisfying (35), such that x 1 X | y 1 and x 2 X | y N . Without loss of generality, we can assume X | y 1 = X | y 1 and X | y 2 = X | y 2 . Since C is a partition, we understand that X | y 1 C ( x 1 ) and X | y 2 C ( x 1 ) . It follows that there exist two consecutive conditional ranges along the sequence { X | y i } i = 1 N and two sets of the δ -overlap family C covering them, such that X | y i 1 C ( x i 1 ) and X | y i C ( x i ) and that C ( x i 1 ) C ( x i ) . Similarly to (50), we hold that
m X ( C ( x i 1 ) C ( x i ) ) m X ( X ) > δ ,
and Property 2 of Definition 6 is violated. Thus, C ( x 1 ) and C ( x 2 ) do not exist, which implies that C is a δ -isolated partition.
Let P be any other δ -isolated partition. We wish to show that | C | | P | and that the equality holds if and only if P = C . First, note that every set C ( x ) C can intersect, at most, one set in P ; otherwise, the sets in P would not be δ -isolated. Second, since C is a cover of X , every set in P must be intersected by at least one set in C . It follows that
| C | | P | .
Now, assume the equality holds. In this case, there is a one-to-one correspondence P : C P , such that for all x X , we have C ( x ) P ( C ( x ) ) , and since both C and P are partitions of X , it follows that C = P . Conversely, assuming that C = P , then | C | = | P | follows trivially. □
We have introduced the notion of mutual information from Y to X in terms of the conditional range X | Y . Since, in general, we have X | Y Y | X , one may expect the definition of mutual information to be asymmetric in its arguments. Namely, the amount of information provided about X by the knowledge of Y may not be the same as the amount of information provided about Y by the knowledge of X. Although this is true in general, we show that for disassociated UVs, symmetry is retained, provided that when swapping X with Y, one also rescales δ appropriately. The following theorem establishes the symmetry in the mutual information under the appropriate scaling of the parameters δ 1 and δ 2 . The proof requires the introduction of the notions of taxicab connectedness, taxicab family, and taxicab partition, which are given in Appendix C.1, along with the proof of the theorem.
Theorem 3. 
If ( X , Y ) d ( δ 1 , δ 2 ) and a ( δ 1 , δ 2 ) -taxicab family of X , Y exists, then we have
I δ 1 ( X ; Y ) = I δ 2 ( Y ; X ) .

4. ( ϵ , δ )-Capacity

We now give a definition of the capacity of a communication channel and relate it to the notion of mutual information between the UVs introduced above. We consider a normed space X to be totally bounded if, for every ϵ > 0 , X can be covered by a finite number of open balls of radius ϵ . We let X be a totally bounded, normed space such that for all x X , we have x 1 , where . represents the norm. This normalization is for convenience in the notation process, and all results can easily be extended to metric spaces of any bounded norm. Let 𝒳 X be a discrete set of points in the space, which represents a codebook.
Definition 9. 
ϵ-perturbation channel.
A channel is called ϵ-perturbation if for any transmitted codeword x 𝒳 , x is received with noise perturbation at most ϵ. Namely, we receive a point in the set
S ϵ ( x ) = { y X : x y ϵ } .
Given the codebook 𝒳 is transmitted over an ϵ -perturbation channel, all received codewords lie in the set 𝒴 = x 𝒳 S ϵ ( x ) , where 𝒴 Y = X . Transmitted codewords can be decoded correctly as long as the corresponding uncertainty sets at the receiver do not overlap. This can be achieved by simply associating the received codeword to the point in the codebook that is closest to it.
For any x 1 , x 2 𝒳 , we now let
e ϵ ( x 1 , x 2 ) = m Y ( S ϵ ( x 1 ) S ϵ ( x 2 ) ) m Y ( Y ) ,
where m Y ( . ) is an uncertainty function defined over the space Y . We also assume without loss of generality that the uncertainty associated with the whole space Y of received codewords is m Y ( Y ) = 1 . Finally, we let V ϵ Y be the smallest uncertainty set corresponding to a transmitted codeword, namely V ϵ = S ϵ ( x ) , where x = argmin x X m Y ( S ϵ ( x ) ) . The quantity 1 e ϵ ( x 1 , x 2 ) can be viewed as the confidence we have in not confusing x 1 and x 2 in any transmission or, equivalently, as the amount of adversarial effort required to induce a confusion between the two codewords. For example, if the uncertainty function is constructed using a measure, then all the erroneous codewords generated by an adversary to decode x 2 instead of x 1 must lie inside the equivocation set depicted in Figure 3, whose relative size is given by (56). The smaller the equivocation set is, the larger the effort required by the adversary to induce an error must be. If the uncertainty function represents the diameter of the set, then all the erroneous codewords generated by an adversary to decode x 2 instead of x 1 will be close to each other in the sense of (56). Once again, the closer the possible erroneous codewords are, the harder it must be for the adversary to generate an error, since any small deviation allows the decoder to correctly identify the transmitted codeword.
We now introduce the notion of a distinguishable codebook, ensuring that every codeword cannot be confused with any other codeword, rather than with a specific one, at a given level of confidence.
Definition 10. 
( ϵ , δ ) -distinguishable codebook.
For any 0 < ϵ 1 , 0 δ < m Y ( V ϵ ) , a codebook 𝒳 X is ( ϵ , δ ) -distinguishable if for all x 1 , x 2 𝒳 , we have e ϵ ( x 1 , x 2 ) δ / | 𝒳 | .
For any ( ϵ , δ ) -distinguishable codebook 𝒳 and x 𝒳 , we let
e ϵ ( x ) = x 𝒳 : x x e ϵ ( x , x ) .
It now follows from Definition 10 that
e ϵ ( x ) δ ,
and each codeword in an ( ϵ , δ ) -distinguishable codebook can be decoded correctly with confidence of at least 1 δ . Definition 10 guarantees even more, namely that the confidence of not confusing any pair of codewords is uniformly bounded by 1 δ / | 𝒳 | . This stronger constraint implies that we cannot “balance” the error associated with a codeword transmission by allowing some decoding pair to have a lower confidence and enforcing other pairs to have higher confidence. This is the main difference between our definition and the one used in [7], which bounds the average confidence and allows us to relate the notion of capacity to the mutual information between pairs of codewords.
Definition 11. 
( ϵ , δ ) -capacity.
For any totally bounded, normed metric space X , 0 < ϵ 1 , 0 δ < m Y ( V ϵ ) , and the ( ϵ , δ ) -capacity of X is
C ϵ δ = sup 𝒳 X ϵ δ log 2 | 𝒳 | b i t s ,
where X ϵ δ = { 𝒳 : 𝒳 i s ( ϵ , δ ) d i s t i n g u i s h a b l e } is the set of ( ϵ , δ ) -distinguishable codebooks.
The ( ϵ , δ ) -capacity represents the largest number of bits that can be communicated by using any ( ϵ , δ ) -distinguishable codebook. The corresponding geometric picture is illustrated in Figure 4. For δ = 0 , our notion of capacity reduces to Kolmogorov’s ϵ -capacity, which is the logarithm of the packing number of the space with balls of radius ϵ .
In the definition of capacity, we have restricted δ < m Y ( V ϵ ) to rule out cases when the decoding error can be at least as large as the error introduced by the channel and when the ( ϵ , δ ) -capacity is infinite. Also, note that m Y ( V ϵ ) 1 since V ϵ Y and (10) holds.
We now relate our operational definition of capacity to the notion of UVs and mutual information introduced in Section 3. Let X be the UV corresponding to the transmitted codeword. This is a map X : X 𝒳 and X = 𝒳 X . Likewise, let Y be the UV corresponding to the received codeword. This is a map Y : Y 𝒴 and Y = 𝒴 Y . For our ϵ -perturbation channel, these UVs are such that for all y Y and x X , we have
Y | x = { y Y : x y ϵ } ,
X | y = { x X : x y ϵ } ,
(see Figure 5). Clearly, the set in (60) is continuous, while the set in (61) is discrete.
To measure the levels of association and disassociation between X and Y, we use an uncertainty function m X ( . ) defined over X and m Y ( . ) defined over Y . We introduce the feasible set
F δ = { X : X X ,   and   either ( X , Y ) d ( 0 , δ / | X | ) or ( X , Y ) a ( 1 , δ / | X | ) } ,
representing the set of UVs X such that the marginal range X is a discrete set representing a codebook, and the UV can either achieve ( 0 , δ / | X | ) levels of disassociation or ( 1 , δ / | X | ) levels of association with Y. In our channel model, this feasible set also depends on the ϵ -perturbation through (60) and (61).
We can now state the non-stochastic channel coding theorem for our ϵ -perturbation channel.
Theorem 4. 
For any totally bounded, normed metric space X , ϵ-perturbation channel satisfying (60) and (61), 0 < ϵ 1 , and 0 δ < m Y ( V ϵ ) , we have
C ϵ δ = sup X F δ ˜ , δ ˜ δ / m Y ( Y ) I δ ˜ / | X | ( Y ; X ) b i t s .
Proof. 
First, we show that there exists a UV X and δ ˜ δ / m Y ( Y ) such that X F δ ˜ , which implies that the supremum is well defined. Second, for all X and δ ˜ such that
X F δ ˜ ,
and
δ ˜ δ / m Y ( Y ) ,
we show that
I δ ˜ / | X | ( Y ; X ) C ϵ δ .
Finally, we show the existence of X F δ ˜ and δ ˜ δ / m Y ( Y ) such that I δ ˜ / | X | ( Y ; X ) = C ϵ δ .
Let us begin with the first step. Consider a point x X . Let X be a UV such that
X = { x } .
Then, we hold that the marginal range of the UV Y corresponding to the received variable is
Y = Y | x ,
and, therefore, for all y Y , we have
X | y = { x } .
Using Definition 2 and (67), we hold that
A ( Y ; X ) = ,
because X consists of a single point, and, therefore, the set in (12) is empty.
On the other hand, using Definition 2 and (69), we have
A ( X ; Y ) = { 1 }   if   y 1 , y 2 Y , otherwise .
Using (70), and since A δ holds for A = , we have
A ( Y ; X ) δ / ( | X | m Y ( Y ) ) .
Similarly, using (71), we have
A ( X ; Y ) 1 .
Now, combining (72) and (73), we have
( X , Y ) a ( 1 , δ / ( | X | m Y ( Y ) ) ) .
Letting δ ˜ = δ / m Y ( Y ) , this implies that X F δ ˜ and the first step of the proof is complete.
To prove the second step, we define the set of discrete UVs
G = { X : X X , δ ˜ δ / m Y ( Y ) such   that   S 1 , S 2 Y | X , m Y ( S 1 S 2 ) / m Y ( Y ) δ ˜ / | X | } ,
which is a larger set than the one containing all UVs X that are ( 1 , δ ˜ / | X | ) associated with Y. Now, we will show that if a UV X G , then the corresponding codebook 𝒳 X ϵ δ . If X G , then there exists a δ ˜ δ / m Y ( Y ) such that for all S 1 , S 2 Y | X , we have
m Y ( S 1 S 2 ) m Y ( Y ) δ ˜ | X | .
It follows that for all x 1 , x 2 X , we have
m Y ( Y | x 1 Y | x 2 ) m Y ( Y ) δ ˜ | X | .
Using 𝒳 = X , (60), Y = 𝒴 = x 𝒳 S ϵ ( x ) and m Y ( Y ) = 1 , for all x 1 , x 2 𝒳 , we have
m Y ( S ϵ ( x 1 ) S ϵ ( x 2 ) ) m Y ( Y ) δ ˜ m Y ( Y ) | 𝒳 | , ( a ) δ | 𝒳 | ,
where ( a ) follows from δ ˜ δ / m Y ( Y ) . Putting things together, it follows that
X G 𝒳 X ϵ δ
Consider now a pair of X and δ ˜ such that δ ˜ δ / m Y ( Y ) and
X F δ ˜ .
If ( X , Y ) d ( 0 , δ ˜ / | X | ) , then, using Lemma A1 in Appendix C, there exist two UVs, X ¯ and Y ¯ and δ ¯ δ / m Y ( Y ¯ ) , such that
( X ¯ , Y ¯ ) a ( 1 , δ ¯ / | X ¯ | ) ,
and
| Y | X δ ˜ / | X | | = | Y ¯ | X ¯ δ ¯ / | X ¯ | | .
On the other hand, if ( X , Y ) a ( 1 , δ ˜ / | X | ) , then (81) and (82) also trivially hold. It then follows that (81) and (82) hold for all X F δ ˜ . We now have
I δ ˜ / | X | ( Y ; X ) = log ( | Y | X δ ˜ / | X | | ) = ( a ) log ( | Y ¯ | X ¯ δ ¯ / | X ¯ | | ) ( b ) log ( | X ¯ | ) , = ( c ) log ( | 𝒳 ¯ | ) , ( d ) C ϵ δ ,
where ( a ) follows from (81) and (82), ( b ) follows from Lemma A3 in Appendix C since δ ¯ δ / m Y ( Y ¯ ) < m Y ( V ϵ ) / m Y ( Y ¯ ) , ( c ) follows by defining the codebook 𝒳 ¯ corresponding to the UV X ¯ , and ( d ) follows from the fact that using (81) and Lemma 1 allows X ¯ G , which implies for (79) that 𝒳 ¯ X ϵ δ .
Finally, let
X = a r g s u p 𝒳 X ϵ δ log ( | 𝒳 | ) ,
which achieves the capacity C ϵ δ . Let X be the UV whose marginal range corresponds to the codebook X . It follows that for all S 1 , S 1 Y | X , we have
m Y ( S 1 S 1 ) m Y ( Y ) δ | X | ,
which implies that m Y ( Y ) = 1 ,
m Y ( S 1 S 1 ) m Y ( Y ) δ | X | m Y ( Y ) .
Letting δ = δ / m Y ( Y ) , and using Lemma 1, we hold that ( X , Y ) a ( 1 , δ / | X | ) , which implies that X δ ˜ δ / m Y ( Y F δ ˜ , and the proof is complete. □
Theorem 4 characterizes the capacity as the supremum of the mutual information over all UVs in the feasible set. The following theorem shows that the same characterization is obtained if we optimize the right-hand side in (63) over all UVs in the space. It follows by Theorem 4 that rather than optimizing over all UVs representing all the codebooks in the space, a capacity-achieving codebook can be found within the smaller class δ ˜ δ / m Y ( V ϵ ) F δ ˜ of feasible sets with error at most δ / m Y ( V ϵ ) , since for all Y Y , m Y ( V ϵ ) m Y ( Y ) .
Theorem 5. 
The ( ϵ , δ ) -capacity in (63) can also be written as
C ϵ δ = sup X : X X , δ ˜ δ / m Y ( Y ) I δ ˜ / | X | ( Y ; X ) b i t s .
Proof. 
Consider a UV X δ ˜ δ / m Y ( Y ) F δ ˜ , where Y is the corresponding UV at the receiver. The idea of the proof is to show the existence of a UV X ¯ δ ˜ δ / m Y ( Y ¯ ) F δ ˜ and the corresponding UV Y ¯ at the receiver, and
δ ¯ = δ ˜ m Y ( Y ) / m Y ( Y ¯ ) δ / m ( Y ¯ ) ,
such that the cardinality of the overlap partitions
| Y ¯ | X ¯ δ ¯ / | X ¯ | | = | Y | X δ ˜ / | X | | .
Let the cardinality
| Y | X δ ˜ / | X | | = K .
By Property 1 of Definition 6, we hold that for all S i Y | X δ ˜ / | X | , there exists an x i X such that Y | x i S i . Now, consider another UV X ¯ whose marginal range is composed of K elements of X , namely
X ¯ = { x 1 , x K } .
Let Y ¯ be the UV corresponding to the received variable. Using the fact that for all x X , we have Y ¯ | x = Y | x since (60) holds, and using Property 2 of Definition 6, for all x , x X ¯ , we obtain
m Y ( Y ¯ | x Y ¯ | x ) m Y ( Y ) δ ˜ | X | , ( a ) δ ˜ | X ¯ | ,
where ( a ) follows from the fact that X ¯ X using (91). Then, for all x , x X ¯ , we hold that
m Y ( Y ¯ | x Y ¯ | x ) m Y ( Y ¯ ) δ ˜ m Y ( Y ) | X ¯ | m Y ( Y ¯ ) = δ ¯ | X ¯ | ,
since δ ¯ = δ ˜ m Y ( Y ) / m Y ( Y ¯ ) . Then, by Lemma 1, it follows that
( X ¯ , Y ¯ ) a ( 1 , δ ¯ / | X ¯ | ) .
Since δ ˜ δ / m Y ( Y ) , we have
δ ¯ δ / m Y ( Y ¯ ) < m Y ( V ϵ ) / m Y ( Y ¯ ) .
Therefore, X ¯ F δ ¯ and δ ¯ δ / m Y ( Y ¯ ) . We now hold that
| Y ¯ | X ¯ δ ¯ / | X ¯ | | = ( a ) | X ¯ | = ( b ) | Y | X δ ˜ / | X | | ,
where ( a ) follows by applying Lemma A4 in Appendix C using (94) and (95) and ( b ) follows from (90) and (91). Combining (96) with Theorem 4, the proof is complete. □
We now make some considerations with respect to previous results in the literature. First, we note that for δ = 0 , all of our definitions reduce to Nair’s ones and Theorem 4 recovers Nair’s coding theorem ([2] (Theorem 4.1)) for the zero-error capacity of an additive ϵ -perturbation channel.
Second, we point out that the ( ϵ , δ ) -capacity considered in [7] defines the set of ( ϵ , δ ) -distinguishable codewords such that the average overlap among all codewords is at most δ . In contrast, our definition requires the overlap for each pair of codewords to be at most δ / | 𝒳 | . The following theorem provides the relationship between our C ϵ δ and the capacity C ˜ ϵ δ considered in [7], which is defined using the Euclidean norm.
Theorem 6. 
Let C ˜ ϵ δ be the ( ϵ , δ ) -capacity defined in [7]. We have
C ϵ δ C ˜ ϵ δ / ( 2 m Y ( V ϵ ) ) ,
and
C ˜ ϵ δ C ϵ δ m Y ( V ϵ ) 2 2 C ˜ ϵ δ + 1 .
Proof. 
For every codebook 𝒳 X ϵ δ and x 1 , x 2 𝒳 , we have
e ϵ ( x 1 , x 2 ) δ / | 𝒳 | .
Since m Y ( Y ) = 1 , this implies that for all x 1 , x 2 𝒳 , we have
m Y ( S ϵ ( x 1 ) S ϵ ( x 2 ) ) δ / | 𝒳 | .
For all 𝒳 X , the average overlap defined in ([7] (53)) is
Δ = 1 | 𝒳 | x 𝒳 e ϵ ( x ) 2 m Y ( V ϵ ) .
Then, we have
Δ = 1 2 | 𝒳 | m Y ( V ϵ ) x 1 , x 2 𝒳 m Y ( S ϵ ( x 1 ) S ϵ ( x 2 ) ) , ( a ) δ | 𝒳 | 2 2 | 𝒳 | 2 m Y ( V ϵ ) , δ 2 m Y ( V ϵ ) ,
where ( a ) follows from (100). Thus, we have
C ϵ δ C ˜ ϵ δ / ( 2 m Y ( V ϵ ) ) ,
and (97) follows.
Now, let 𝒳 be a codebook with average overlap at most δ , namely
1 2 | 𝒳 | m Y ( V ϵ ) x 1 , x 2 𝒳 m Y ( S ϵ ( x 1 ) S ϵ ( x 2 ) ) δ .
This implies that for all x 1 , x 2 𝒳 , we have
| 𝒳 | m Y ( S ϵ ( x 1 ) S ϵ ( x 2 ) ) m Y ( Y ) 2 δ | 𝒳 | 2 m Y ( V ϵ ) m Y ( Y ) , = ( a ) 2 δ | 𝒳 | 2 m Y ( V ϵ ) , δ 2 2 C ˜ ϵ δ + 1 m Y ( V ϵ ) ,
where ( a ) follows from the fact that m Y ( Y ) = 1 . Thus, we have
C ˜ ϵ δ C ϵ δ 2 2 C ˜ ϵ δ + 1 m Y ( V ϵ ) ,
and (98) follows. □
To better understand the relationship between the two capacities and show how they can be distinct, consider the case in which the output space is the union of the three ϵ -balls depicted in Figure 6; this is the only feasible output configuration.
We now compute the two capacities C ϵ δ and C ˜ ϵ δ in this case. We have
m Y ( S ϵ ( x 1 ) S ϵ ( x 2 ) ) = δ ,
and the average overlap (101) is
Δ = 1 3 δ 2 = δ 6 .
It follows that
C ˜ ϵ δ = log 2 3 , if δ δ / 6 . log 2 2 , otherwise .
On the other hand, the worst case overlap is
m Y ( S ϵ ( x 1 ) S ϵ ( x 2 ) ) = δ = 3 δ | X | ,
and it follows that
C ϵ δ = log 2 3 , if δ 3 δ . log 2 2 , otherwise .

5. ( N , δ ) -Capacity of General Channels

We now extend our results to more general channels where the noise can be different across codewords and is not necessarily contained within a ball of radius ϵ .
Let 𝒳 X be a discrete set of points in the space, which represents a codebook. Any point x 𝒳 represents a codeword that can be selected at the transmitter, sent over the channel, and received with perturbation. A channel with transition mapping N : X Y associates with any point in X a set in Y , such that the received codeword lies in the set
S N ( x ) = { y Y : y N ( x ) } .
Figure 7 illustrates possible uncertainty sets associated with three different codewords.
All received codewords lie in the set 𝒴 = x 𝒳 S N ( x ) , where 𝒴 Y . For any x 1 , x 2 X , we now let
e N ( x 1 , x 2 ) = m Y ( S N ( x 1 ) S N ( x 2 ) ) m Y ( Y ) ,
where m Y ( . ) is an uncertainty function defined over Y . We also assume without loss of generality that the uncertainty associated with the space Y of received codewords is m Y ( Y ) = 1 . We also let V N = N ( x ) , where x = argmin x X m Y ( N ( x ) ) . Thus, V N is the set corresponding to the minimum uncertainty introduced by the noise mapping N.
Definition 12. 
( N , δ ) -distinguishable codebook.
For any 0 δ < m Y ( V N ) , a codebook 𝒳 X is ( N , δ ) -distinguishable if for all x 1 , x 2 𝒳 , we have e N ( x 1 , x 2 ) δ / | 𝒳 | .
Definition 13. 
( N , δ ) -capacity.
For any totally bounded, normed metric space X , channel with transition mapping N, and 0 δ < m Y ( V N ) , the ( N , δ ) -capacity of X is
C N δ = sup 𝒳 X N δ log 2 | 𝒳 | b i t s ,
where X N δ = { 𝒳 : 𝒳 i s ( N , δ ) d i s t i n g u i s h a b l e } .
We now relate our definition of capacity to the notion of UVs and mutual information introduced in Section 3. As usual, let X be the UV corresponding to the transmitted codeword and Y be the UV corresponding to the received codeword. For a channel with transition mapping N, these UVs are such that for all y Y and x X , we have
Y | x = { y Y : y N ( x ) } ,
X | y = { x X : y N ( x ) } .
To measure the levels of association and disassociation between UVs X and Y, we use an uncertainty function m X ( . ) defined over X , and m Y ( . ) is defined over Y . The definition of the feasible set is the same as the one given in (62). In our channel model, this feasible set depends on the transition mapping N through (115) and (116).
We can now state the non-stochastic channel coding theorem for channels with transition mapping N.
Theorem 7. 
For any totally bounded, normed metric space X , channel with transition mapping N satisfying (115) and (116), and 0 δ < m Y ( V N ) , we have
C N δ = sup X F δ ˜ , δ ˜ δ / m Y ( Y ) I δ ˜ / | X | ( Y ; X ) b i t s .
The proof is along the same lines as the one of Theorem 4 and is omitted.
Theorem 7 characterizes the capacity as the supremum of the mutual information over all codebooks in the feasible set. The following theorem shows that the same characterization is obtained if we optimize the right hand side in (117) over all codebooks in the space. It follows by Theorem 7 that rather than optimizing over all codebooks, a capacity-achieving codebook can be found within the smaller class δ ˜ δ / m Y ( V N ) F δ ˜ of feasible sets with error at most δ / m Y ( V N ) .
Theorem 8. 
The ( N , δ ) -capacity in (117) can also be written as
C N δ = sup X : X X , δ ˜ δ / m Y ( Y ) I δ ˜ / | X | ( Y ; X ) b i t s .
The proof is along the same lines as the one of Theorem 5 and is omitted.

6. Capacity of Stationary Memoryless Uncertain Channels

In this section, we consider the special case of stationary, memoryless, uncertain channels.
Let X be the space of X -valued discrete-time functions x : Z > 0 X , where Z > 0 is the set of positive integers denoting the time step. Let x ( a : b ) denote the function x X restricted over the time interval [ a , b ] . Let 𝒳 X be a discrete set which represents a codebook. Also, let 𝒳 ( 1 : n ) = x 𝒳 x ( 1 : n ) denote the set of all codewords up to time n and 𝒳 ( n ) = x 𝒳 x ( n ) denote the set of all codeword symbols in the codebook at time n. The codeword symbols can be viewed as the coefficients representing a continuous signal in an infinite-dimensional space. For example, transmitting one symbol per time step can be viewed as transmitting a signal of unit spectral support over time. Any discrete-time function x 𝒳 can be selected at the transmitter, sent over a channel, received with noise perturbation, and introduced by the channel. The perturbation of the signal at the receiver due to the noise can be described as a displacement experienced by the corresponding codeword symbols x ( 1 ) , x ( 2 ) , . To describe this perturbation, we consider the set-valued map N : X 2 Y , associating any point in X to a set in Y , where Y is the space of Y -values discrete-time functions. For any transmitted codeword x 𝒳 X , the corresponding received codeword lies in the set
S N ( x ) = { y Y : y N ( x ) } .
Also, the noise set associated with x ( 1 : n ) 𝒳 ( 1 : n ) is
S N ( x ( 1 : n ) ) = { y ( 1 : n ) Y n : y N ( x ) } ,
where Y n = Y × Y × × Y n . We are now ready to define stationary, memoryless, uncertain channels.
Definition 14. 
A stationary, memoryless, uncertain channel is a transition mapping N : X 2 Y that can be factorized into identical terms describing the noise experienced by the codeword symbols. Namely, there exists a set-valued map N : X Y such that for all n Z > 0 and x ( 1 : n ) X , we have
S N ( x ( 1 : n ) ) = N ( x ( 1 ) ) × × N ( x ( n ) ) .
According to the definition, a stationary, memoryless, uncertain channel maps the nth input symbol into the nth output symbol in a way that does not depend on the symbols at other time steps, and the mapping is the same at all time steps. Since the channel can be characterized by the mapping N, to simplify the notation, we will use S N ( . ) instead of S N ( . ) .
Another important observation is that the ϵ -perturbation channel in Definition 9 may not admit a factorization like the one in (121). For example, consider the space to be equipped with the L 2 norm, the codeword symbols to represent the coefficients of an orthogonal representation of a transmitted signal, and the noise experienced by any codeword to be within a ball of radius ϵ . In this case, if a codeword symbol is perturbed by a value close to ϵ , the perturbation of all other symbols must be close to zero.
For stationary, memoryless, uncertain channels, all received codewords lie in the set 𝒴 = x 𝒳 S N ( x ) , and the received codewords up to time n lie in the set 𝒴 ( 1 : n ) = x 𝒳 S N ( x ( 1 : n ) ) . Then, for any x 1 ( 1 : n ) , x 2 ( 1 : n ) 𝒳 ( 1 : n ) , we let
e N ( x 1 ( 1 : n ) , x 2 ( 1 : n ) ) = m Y ( S N ( x 1 ( 1 : n ) ) S N ( x 2 ( 1 : n ) ) ) m Y ( Y n ) ,
where m Y ( . ) is an uncertainty function defined over the space of the received codewords. We also assume without loss of generality that at any time step n, the uncertainty associated with the space Y n of received codewords is m Y ( Y n ) = 1 . We also let V N = N ( x ) , where x = argmin x X m Y ( N ( x ) ) . Thus, V N is the set corresponding to the minimum uncertainty introduced by the noise mapping at a single time step. Finally, we let V N n = V N × V N × × V N n . The quantity 1 e ϵ ( x 1 ( 1 : n ) , x 2 ( 1 : n ) ) can be viewed as the confidence we have of not confusing x 1 ( 1 : n ) and x 2 ( 1 : n ) in any transmission or, equivalently, as the amount of adversarial effort required to induce a confusion between the two codewords. For example, if the uncertainty function is constructed using a measure, then all the erroneous codewords generated by an adversary to decode x 2 ( 1 : n ) instead of x 1 ( 1 : n ) must lie inside the equivocation set S N ( x 1 ( 1 : n ) ) S N ( x 2 ( 1 : n ) ) whose relative size is given by (122). The smaller the equivocation set is, the larger the effort required by the adversary to induce an error must be. If the uncertainty function represents the diameter of the set, then all the erroneous codewords generated by an adversary to decode x 2 ( 1 : n ) instead of x 1 ( 1 : n ) will be close to each other, in the sense of (122).
We now introduce the notion of a distinguishable codebook, ensuring that every codeword cannot be confused with any other codeword, rather than with a specific one, at a given level of confidence.
Definition 15. 
( N , δ n ) -distinguishable codebook.
For all n Z > 0 and 0 δ n < m Y ( V N n ) , a codebook 𝒳 n = 𝒳 ( 1 : n ) is ( N , δ n ) -distinguishable if for all x 1 ( 1 : n ) , x 2 ( 1 : n ) 𝒳 n , we have
e N ( x 1 ( 1 : n ) , x 2 ( 1 : n ) ) δ n / | 𝒳 n | .
It immediately follows that for any ( N , δ n ) -distinguishable codebook 𝒳 n , we have
e N ( x ( 1 : n ) ) = x ( 1 : n ) 𝒳 n : x ( 1 : n ) x ( 1 : n ) e N ( x ( 1 : n ) , x ( 1 : n ) ) δ n ,
so that each codeword in 𝒳 n can be decoded correctly with confidence at least 1 δ n . Definion 15 guarantees even more, namely that the confidence of not confusing any pair of codewords is at least 1 δ n / | 𝒳 n | .
We now associate with any sequence { δ n } the largest distinguishable rate sequence { R δ n } , whose elements represent the largest rates that satisfy that confidence sequence.
Definition 16. 
Largest { δ n } -distinguishable rate sequence.
For any sequence { δ n } , the largest { δ n } -distinguishable rate sequence { R δ n } is such that for all n Z > 0 , we have
R δ n = sup 𝒳 n X N δ n ( n ) log 2 | 𝒳 n | n b i t s   p e r   s y m b o l ,
where X N δ n ( n ) = { 𝒳 n : 𝒳 n   i s ( N , δ n ) d i s t i n g u i s h a b l e } .
We say that any constant rate R that lays below the largest { δ n } -distinguishable rate sequence is { δ n } -distinguishable. Such a { δ n } -distinguishable rate ensures the existence of a sequence of distinguishable codes that, for all n Z > 0 , have a rate of at least R and confidence of at least 1 δ n .
Definition 17. 
{ δ n } -distinguishable rate.
For any sequence { δ n } , a constant rate R is said to be { δ n } -distinguishable if for all n Z > 0 , we have
R R δ n .
We now give our first definition of capacity for stationary, memoryless, uncertain channels as the supremum of the { δ n } -distinguishable rates. Using this definition, transmitting at a constant rate below capacity ensures the existence of a sequence of codes that, for all n Z > 0 , have confidence of at least 1 δ n .
Definition 18. 
( N , { δ n } ) capacity.
For any stationary, memoryless, uncertain channel with transition mapping N, and any given sequence { δ n } , we let
C N ( { δ n } ) = sup { R : R i s { δ n } d i s t i n g u i s h a b l e }
= inf n Z > 0 R δ n b i t s   p e r   s y m b o l .
Another definition of capacity arises if, rather than the largest lower bound to the sequence of rates, one considers the least upper bound for which we can transmit, satisfying a given confidence sequence. Using this definition, transmitting at a constant rate below capacity ensures the existence of a finite-length code (rather than a sequence of codes) that satisfies at least one confidence value along the sequence { δ n } .
Definition 19. 
( N , { δ n } ) capacity.
For any stationary, memoryless, uncertain channel with transition mapping N, and any given sequence { δ n } , we define
C N ( { δ n } ) = sup n Z > 0 R δ n b i t s   p e r   s y m b o l .
Next, consider Definition 19 in the case of { δ n } as a constant sequence; namely, for all n Z > 0 , we have δ n = δ 0 . In this case, transmitting below capacity ensures the existence of a finite-length code that has confidence of at least 1 δ . This is a generalization of the zero-error capacity,
Definition 20. 
( N , δ ) capacity.
For any stationary, memoryless, uncertain channel with transition mapping N and any sequence { δ n } , where for all n Z > 0 we have δ n = δ 0 , we define
C N δ = sup n Z > 0 R δ n b i t s   p e r   s y m b o l .
Letting δ = 0 , we obtain the zero-error capacity. In this case, below capacity, there exists a code with which we can transmit with full confidence.
Finally, to give a definition of a non-stochastic analog of Shannon’s probabilistic capacity, we first say that any constant rate R is achievable if there exists a sequence δ n 0 as n such that R lays below lim sup n R δ n . An achievable rate R then ensures that for all ϵ > 0 , there exists an infinite sequence of distinguishable codes of rate of at least R ϵ whose confidence tends towards one as n . It follows that in this case, we can achieve communication at rate R with arbitrarily high confidence by choosing a sufficiently large codebook.
Definition 21. 
Achievable rate.
A constant rate R is achievable if there exists a sequence { δ n } such that δ n 0 as n and
R lim sup n R δ n .
We now introduce the non-stochastic analog of Shannon’s probabilistic capacity as the supremum of the achievable rates. This means that we can pick any confidence sequence such that δ n tends towards zero as n . In this way, δ n plays the role of the probability of error and the capacity is the largest rate that can be achieved by a sequence of codebooks with an arbitrarily high confidence level. Using this definition, transmitting at a rate below capacity ensures the existence of a sequence of codes achieving arbitrarily high confidence by increasing the codeword size.
Definition 22. 
( N , { 0 } ) capacity.
For any stationary, memoryless, uncertain channel with transition mapping N, we define the ( N , { 0 } ) capacity as
C N ( { 0 } ) = sup { R : R   i s   a c h i e v a b l e }
= sup { δ n } : δ n = o ( 1 ) lim sup n R δ n .
We point out the key difference between Definitions 20 and 22. Transmitting below the ( N , δ ) capacity ensures the existence of a fixed codebook that has confidence of at least 1 δ . In contrast, transmitting below the ( N , { 0 } ) capacity allows us to achieve arbitrarily high confidence by increasing the codeword size.
To give a visual illustration of the different definitions of capacity, we refer to Figure 8.
For a given sequence { δ n } , the figure sketches the largest { δ n } -distinguishable rate sequence R δ n . According to definitions 18 and 19, the capacities C N ( { δ n } ) and C N ( { δ n } ) are given by the supremum and infimum of this sequence, respectively. On the other hand, according to Definition 22, the capacity C N ( { 0 } ) is the largest limsup over all vanishing sequences { δ n } . Assuming the figure refers to a vanishing sequence { δ n } that achieves the supremum in (133), we have
C N ( { δ n } ) C N ( { 0 } ) C N ( { δ n } ) .
We now relate our notions of capacity to the mutual information rate between transmitted and received codewords. Let X be the UV corresponding to the transmitted codeword. This is a map of X : X 𝒳 and X = 𝒳 X . Restricting this map to a finite time n Z > 0 yields another UV X ( n ) and X ( n ) = 𝒳 ( n ) X . Likewise, a codebook segment is a UV X ( a : b ) = { X ( n ) } a n b of marginal range X ( a : b ) = 𝒳 ( a : b ) X b a + 1 . Likewise, let Y be the UV corresponding to the received codeword. It is a map of Y : Y 𝒴 and Y = 𝒴 Y . Y ( n ) and Y ( a : b ) are UVs, and Y ( n ) = 𝒴 Y and Y ( a : b ) = 𝒴 ( a : b ) Y b a + 1 . For a stationary, memoryless, uncertain channel with transition mapping N, these UVs are such that for all n Z > 0 , y ( 1 : n ) Y ( 1 : n ) and x ( 1 : n ) X ( 1 : n ) , and we have
Y ( 1 : n ) | x ( 1 : n ) = { y ( 1 : n ) Y ( 1 : n ) : y ( 1 : n ) S N ( x ( 1 : n ) ) } ,
X ( 1 : n ) | y ( 1 : n ) = { x ( 1 : n ) X ( 1 : n ) : y ( 1 : n ) S N ( x ( 1 : n ) ) } .
Now, we define the largest δ n -mutual information rate as the supremum mutual information per unit-symbol transmission that a codeword X ( 1 : n ) can provide about Y ( 1 : n ) with confidence of at least 1 δ n / | X ( 1 : n ) | .
Definition 23. 
Largest δ n -information rate.
For all n Z > 0 , the largest δ n -information rate from X ( 1 : n ) to Y ( 1 : n ) is
R δ n I = sup X ( 1 : n ) : X ( 1 : n ) X n , δ ˜ δ n / m Y ( Y ( 1 : n ) ) I δ ˜ / | X ( 1 : n ) | ( Y ( 1 : n ) ; X ( 1 : n ) ) n .
In the following theorem. we establish the relationship between R δ n and R δ n I .
Theorem 9. 
For any totally bounded, normed metric space X , disrete-time space X , stationary, memoryless, uncertain channel with transition mapping N satisfying (135) and (136), and sequence { δ n } such that, for all n Z > 0 , we have 0 δ n < m Y ( V N n ) , we have
R δ n = sup X ( 1 : n ) F δ ˜ ( n ) , δ ˜ δ n / m Y ( Y ( 1 : n ) ) I δ ˜ / | X ( 1 : n ) | ( Y ( 1 : n ) ; X ( 1 : n ) ) n .
We also have
R δ n = R δ n I .
Proof. 
The proof of the theorem is similar to the one of Theorem 4 and is given in Appendix B. □
The following coding theorem is now an immediate consequence of Theorem 9 and of our capacity definitions.
Theorem 10. 
For any totally bounded, normed metric space X , disrete-time space X , stationary, memoryless, uncertain channel with transition mapping N satisfying (135) and (136), and sequence { δ n } such that for all n Z > 0 , 0 δ n < m Y ( V N n ) and 0 δ < m Y ( V N n ) , we have
( 1 ) C N ( { δ n } ) = inf n Z > 0 R δ n I ,
( 2 ) C N ( { δ n } ) = sup n Z > 0 R δ n I ,
( 3 ) C N ( { 0 } ) = sup { δ n } : δ n = o ( 1 ) lim sup n R δ n I ,
( 4 ) C N δ = sup n Z > 0 R δ n I : n Z > 0 , δ n = δ .
Theorem 10 provides multi-letter expressions of capacity, since R δ n I depends on I δ ˜ / | X ( 1 : n ) | ( Y ( 1 : n ) ; X ( 1 : n ) ) according to (137). In the Supplementary Materials, we establish some special cases of uncertainty functions, confidence sequences, and classes of stationary, memoryless, uncertain channels, leading to the factorization of the mutual information and to single-letter expressions.

7. Conclusions and Future Directions

We presented a non-stochastic notion of information with worst-case confidence and related it to the capacity of a communication channel subject to unknown noise. Using the non-stochastic variables framework of Nair [5] and a generalization of the Kolmogorov capacity allowing some amount of overlap in the packing sets [7], we showed that the capacity equals the largest amount of information conveyed by the transmitter to the receiver, with a given level of confidence. These results are the natural generalization of Nair’s ones, obtained in a zero-error framework, and provide an information-theoretic interpretation of the geometric problem of sphere packing with overlap, as studied in [7].
Non-stochastic approaches to information and their use to quantify the performance of various engineering systems have recently received attention in the context of estimation, control, security, communication over non-linear optical channels, and learning systems [8,9,10,11,12,13,14]. We hope that the theory developed here can be useful in the future in some of these contexts. While refinements and extensions of the theory are certainly of interest, explorations of application domains are of paramount importance. There is evidence in the literature regarding the need for a non-stochastic approach to study the flow of information in complex systems, and there is a certain tradition in computer science, especially in the field of online learning, to study various problems in both a stochastic and a non-stochastic setting [15,16,17]. Nevertheless, it seems that only a few isolated efforts have been made towards the formal development of a non-stochastic information theory. Wider involvement of the community in developing alternative, even competing, theories is certainly advisable to eventually fulfill the need of these application areas.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/e27050472/s1, Kolmogorov capacity with overlap.

Author Contributions

Conceptualization, M.F. and A.R.; methodology, M.F. and A.R.; investigation, M.F. and A.R.; writing—original draft preparation, A.R.; writing—review and editing, M.F. and A.R.; supervision, M.F.; project administration, M.F.; funding acquisition, M.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by NSF Award Number: 2127605.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

This article is a revised and expanded version of a paper entitled Insights into [Towards a Non-Stochastic Information Theory], which was presented at [2019 IEEE International Symposium of Information Theory, Paris, France, 7–12 July 2019] and [Channel Coding Theorems in Non-Stochastic Information Theory], which was presented at [2021 IEEE International Symposium of Information Theory, Melbourne, VIC, Australia, 12–20 July 2021] [18,19].

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Proof of Lemma 1

Proof. 
Let ( X , Y ) a ( δ 1 , δ 2 ) . Then,
A ( X ; Y ) δ 1 ,
A ( Y ; X ) δ 2 .
Let
S 1 = ( y 1 , y 2 ) : m X ( X | y 1 X | y 2 | ) m X ( X ) = 0 .
Then, for all ( y 1 , y 2 ) S 1 , we have
m X ( X | y 1 X | y 2 | ) m X ( X ) = 0 δ 1 .
Also, if ( y 1 , y 2 ) S 1 , then
m X ( X | y 1 X | y 2 | ) m X ( X ) A ( X ; Y ) ,
and if ( y 1 , y 2 ) S 1 , then using (9), we have
m X ( X | y 1 X | y 2 | ) m X ( X ) A ( X ; Y ) .
This along with (A1) and (A4) implies that (17) follows.
Likewise, let
S 2 = ( x 1 , x 2 ) : m Y ( Y | x 1 Y | x 2 | ) m Y ( Y ) = 0 .
Then, for all ( x 1 , x 2 ) S 2 ,
m Y ( Y | x 1 Y | x 2 | ) m Y ( Y ) = 0 δ 2 .
Also, if ( x 1 , x 2 ) S 2 , then
m Y ( Y | x 1 Y | x 2 | ) m Y ( Y ) A ( Y ; X ) ,
and if ( y 1 , y 2 ) S 2 , then using (9), we have
m Y ( Y | x 1 Y | x 2 | ) m Y ( Y ) A ( Y ; X ) .
This along with (A2) and (A8) implies that (18) follows.
Now, we prove the opposite direction of the statement. Given that for all y 1 , y 2 Y , we have
m X ( X | y 1 X | y 2 | ) m X ( X ) δ 1 ,
and for all x 1 , x 2 X , we have
m Y ( Y | x 1 Y | x 2 | ) m Y ( Y ) δ 2 .
Then, using the definition of A ( X ; Y ) and A ( Y ; X ) , we have
A ( X ; Y ) δ 1 ,
A ( Y ; X ) δ 2 .
The statement of the lemma follows. □

Appendix B. Proof of Theorem 9

Proof. 
We will show (138). Then, using Lemma A4 in Appendix C, (139) follows using the same argument as in the proof of Theorem 8.
We proceed in three steps. First, we show that for all n > 0 , there exists a UV X ( 1 : n ) and δ ˜ δ n / m Y ( Y ( 1 : n ) ) such that X ( 1 : n ) F δ ˜ ( n ) , which implies that F δ ˜ ( n ) is not empty, and so the supremum is well defined. Second, for all n > 0 , X ( 1 : n ) , and δ ˜ such that
X ( 1 : n ) F δ ˜ ( n ) ,
and
δ ˜ δ n / m Y ( Y ( 1 : n ) ) ,
we show that
I δ ˜ / | X ( 1 : n ) | ( Y ( 1 : n ) ; X ( 1 : n ) ) n R δ n .
Finally, for all n > 0 , we show the existence of X ( 1 : n ) F δ ˜ ( n ) and δ ˜ δ n / m Y ( Y ( 1 : n ) ) such that
I δ ˜ / | X ( 1 : n ) | ( Y ( 1 : n ) ; X ( 1 : n ) ) n = R δ n .
Let us begin with the first step. Consider a point x ( 1 : n ) X n . Let X ( 1 : n ) be a UV such that
X ( 1 : n ) = { x ( 1 : n ) } .
Then, we hold that the marginal range of the UV Y ( 1 : n ) corresponding to the received variable is
Y ( 1 : n ) = Y ( 1 : n ) | x ( 1 : n ) ,
and therefore, for all y ( 1 : n ) Y ( 1 : n ) , we have
X ( 1 : n ) | y ( 1 : n ) = { x ( 1 : n ) } .
Using Definition 2 and (A18), we have
A ( Y ( 1 : n ) ; X ( 1 : n ) ) = ,
because X ( 1 : n ) consists of a single point, and therefore, the set in (12) is empty.
On the other hand, using Definition 2 and (A20), we have
A ( X ( 1 : n ) ; Y ( 1 : n ) ) = { 1 }   if y 1 ( 1 : n ) , y 2 ( 1 : n ) Y ( 1 : n ) , otherwise .
Using (A21), and since A δ holds for A = , we have
A ( Y ( 1 : n ) ; X ( 1 : n ) ) δ n / ( | X ( 1 : n ) | m Y ( Y ( 1 : n ) ) ) .
Similarly, using (A22), we have
A ( X ( 1 : n ) ; Y ( 1 : n ) ) 1 .
Now, combining (A23) and (A24), we have
( X ( 1 : n ) , Y ( 1 : n ) ) a ( 1 , δ n / ( | X ( 1 : n ) | m Y ( Y ( 1 : n ) ) ) ) .
Letting δ ˜ = δ n / m Y ( Y ( 1 : n ) ) , this implies that X ( 1 : n ) F δ ˜ ( n ) and that the first step of the proof is complete.
To prove the second step, we define
G ( n ) = { X ( 1 : n ) : X ( 1 : n ) X n , δ ˜ δ n / m Y ( Y ( 1 : n ) ) such   that S 1 , S 2 Y ( 1 : n ) | X ( 1 : n ) , m Y ( S 1 S 2 ) m Y ( Y ( 1 : n ) ) δ ˜ | X ( 1 : n ) | } ,
which is a larger set than the one containing all UVs X ( 1 : n ) that are ( 1 , δ ˜ / | X ( 1 : n ) | ) associated with Y ( 1 : n ) . Similarly to (79), it can be shown that
X ( 1 : n ) G ( n ) 𝒳 ( 1 : n ) X N δ n ( n )
Consider now a pair, X ( 1 : n ) and δ ˜ , such that δ ˜ δ n / m Y ( Y ( 1 : n ) ) and
X ( 1 : n ) F δ ˜ ( n ) .
If ( X ( 1 : n ) , Y ( 1 : n ) ) d ( 0 , δ ˜ / | X ( 1 : n ) | ) , then, using Lemma A1 in Appendix C, there exist UVs X ¯ ( 1 : n ) and Y ¯ ( 1 : n ) and δ ¯ δ n / m Y ( Y ¯ ( 1 : n ) ) such that
( X ¯ ( 1 : n ) , Y ¯ ( 1 : n ) ) a ( 1 , δ ¯ / | X ¯ ( 1 : n ) | ) ,
and
| Y ( 1 : n ) | X ( 1 : n ) δ ˜ / | X ( 1 : n ) | | = | Y ¯ ( 1 : n ) | X ¯ ( 1 : n ) δ ¯ / | X ¯ ( 1 : n ) | | .
On the other hand, if ( X ( 1 : n ) , Y ( 1 : n ) ) a ( 1 , δ ˜ / | X ( 1 : n ) | ) , then (A29) and (A30) also trivially hold. It then follows that (A29) and (A30) hold for all X ( 1 : n ) F δ ˜ ( n ) . We now have
I δ ˜ / | X ( 1 : n ) | ( Y ( 1 : n ) ; X ( 1 : n ) ) = log ( | Y ( 1 : n ) | X ( 1 : n ) δ ˜ / | X ( 1 : n ) | | ) = ( a ) log ( | Y ¯ ( 1 : n ) | X ¯ ( 1 : n ) δ ¯ / | X ¯ ( 1 : n ) | | ) ( b ) log ( | X ¯ ( 1 : n ) | ) , = ( c ) log ( | 𝒳 ¯ ( 1 : n ) | ) , ( d ) n R δ n ,
where ( a ) follows from (A29) and (A30), ( b ) follows from Lemma A3 in Appendix C since δ ¯ δ n / m Y ( Y ¯ ( 1 : n ) ) < m Y ( V N n ) / m Y ( Y ¯ ( 1 : n ) ) , ( c ) follows by defining the codebook 𝒳 ¯ ( 1 : n ) corresponding to the UV X ¯ ( 1 : n ) , and ( d ) follows from the fact that using (A29) and Lemma 1, we have X ¯ ( 1 : n ) G ( n ) , which implies by (A27) that 𝒳 ¯ ( 1 : n ) X N δ n ( n ) .
For any n Z > 0 , let
𝒳 n = argsup 𝒳 n X N δ n ( n ) log ( | 𝒳 n | ) n ,
which achieves the rate R δ n . Let X be the UV whose marginal range corresponds to the codebook X n . It follows that for all S 1 , S 1 Y | X , we have
m Y ( S 1 S 1 ) m Y ( Y n ) δ n | X | ,
which implies that m Y ( Y n ) = 1 ,
m Y ( S 1 S 1 ) m Y ( Y ) δ n ( | X | m Y ( Y ) ) .
Letting δ = δ n / m Y ( Y ) , and using Lemma 1, we hold that ( X , Y ) a ( 1 , δ / | X | ) , which implies
X δ ˜ δ n / m Y ( Y ) F δ ˜ ( n ) ,
and (138) follows. □

Appendix C. Auxiliary Results

Lemma A1. 
Given a δ < m Y ( V N ) , two UVs X and Y satisfying (60) and (61), and a δ ˜ δ / m Y ( Y ) such that
( X , Y ) d ( 0 , δ ˜ / | X | ) .
Then, there exist two UVs X ¯ and Y ¯ satisfying (60) and (61), and there exists a δ ¯ δ / m Y ( Y ¯ ) such that
( X ¯ , Y ¯ ) a ( 1 , δ ¯ / | X ¯ | ) ,
and
| Y | X δ ˜ / | X | | = | Y ¯ | X ¯ δ ¯ / | X ¯ | | .
Proof. 
Let the cardinality
| Y | X δ ˜ / | X | | = K .
By Property 1 of Definition 6, we hold that for all S i Y | X δ ˜ / | X | , there exists a x i X such that Y | x i S i . Now, consider a new UV X ¯ whose marginal range is composed of K elements of X , namely
X ¯ = { x 1 , x 2 , , x K } .
Let Y ¯ be the UV corresponding to the received variable. Using the fact that for all x X , we have Y ¯ | x = Y | x since (60) holds, and using Property 2 of Definition 6, for all x , x X ¯ , we have
m Y ( Y ¯ | x Y ¯ | x ) m Y ( Y ) δ ˜ | X | , ( a ) δ ˜ | X ¯ | ,
where ( a ) follows from the fact that X ¯ X using (A40). Then, for all x , x X ¯ , we hold that
m Y ( Y ¯ | x Y ¯ | x ) m Y ( Y ¯ ) δ ˜ m Y ( Y ) | X | m Y ( Y ¯ ) , ( a ) δ ¯ | X ¯ | ,
where δ ¯ = δ ˜ m Y ( Y ) / m Y ( Y ¯ ) . Then, by Lemma 1, it follows that
( X ¯ , Y ¯ ) a ( 1 , δ ¯ / | X ¯ | ) .
Since δ ˜ δ / m Y ( Y ) , we have
δ ¯ δ / m Y ( Y ¯ ) < m Y ( V ϵ ) / m Y ( Y ¯ ) .
Using (A43) and (A44), we now hold that
| Y ¯ | X ¯ δ ¯ / | X ¯ | | = ( a ) | X ¯ | = ( b ) | Y | X δ ˜ / | X | | ,
where ( a ) follows from Lemma A4 in Appendix C and ( b ) follows from (A39) and (A40). Hence, the statement of the lemma follows. □
Lemma A2. 
Let
( X , Y ) d ( δ , δ 2 ) .
If x δ x 1 and x δ x 2 , then we hold that x 1 δ x 2 .
Proof. 
Let { X | y i } i = 1 N be the sequence of conditional range connecting x and x 1 . Likewise, let { X | y ˜ i } i = 1 N ˜ be the sequence of conditional range connecting x and x 2 .
Now, by Definition 5, we have
x 1 X | y N ,
x 2 X | y ˜ N ˜ ,
x X | y 1 ,
and
x X | y ˜ 1
Then, using (9), we hold that
m X ( X | y 1 X | y ˜ 1 | m X ( X ) > 0 ,
which implies that
m X ( X | y 1 X | y ˜ 1 | m X ( X ) A ( X ; Y ) .
Using the fact that
( X , Y ) d ( δ , δ 2 ) ,
we will now show that
{ X | y N , X | y N 1 , X | y 1 , X | y ˜ 1 , X | y ˜ N ˜ } ,
is a sequence of conditional ranges connecting x 1 and x 2 . Using (A52) and (A53), we hold that
m X ( X | y 1 X | y ˜ 1 | m X ( X ) > δ .
Also, for all 1 < i N and 1 < j N ˜ , we have
m X ( X | y i X | y i 1 | m X ( X ) > δ ,
and
m X ( X | y ˜ j X | y ˜ j 1 | m X ( X ) > δ .
Also, we have
x 1 X | y N ,   and   x 2 X | y ˜ N ˜ .
Hence, combining (A55), (A56), (A57) and (A58), we hold that x 1 δ x 2 via (A54). □
Lemma A3. 
Consider two UVs X and Y. Let
δ = min y Y m X ( X | y ) m X ( X ) .
If δ 1 < δ , then we have
| X | Y δ 1 | | Y | .
Proof. 
We will prove this by contradiction. Let
| X | Y δ 1 | > | Y | .
Then, by Property 1 of Definition 6, there exists two sets S 1 , S 2 X | Y δ 1 and one singly δ 1 -connected set X | y such that
X | y S 1 ,   and   X | y S 2 .
Then, we have
m X ( S 1 S 2 ) m X ( X ) ( a ) m X ( X | y ) m X ( X ) , ( b ) δ , > ( c ) δ 1 ,
where ( a ) follows from (A62) and (10), ( b ) follows from (A59), and ( c ) follows from the fact that δ 1 < δ . However, by Property 2 of Definition 6, we have
m X ( S 1 S 2 ) m X ( X ) δ 1 .
Hence, we hold that (A63) and (A64) contradict each other, which implies that (A61) does not hold. Hence, the statement of the theorem follows. □
Lemma A4. 
Consider two UVs X and Y. Let
δ = min y Y m X ( X | y ) m X ( X ) .
For all δ 1 < δ and δ 2 1 , if ( X , Y ) a ( δ 1 , δ 2 ) , then we have
| X | Y δ 1 | = | Y | .
Additionally, X | Y is a δ 1 -overlap family.
Proof. 
We show that
X | Y = { X | y : y Y }
is a δ 1 -overlap family. First, note that X | Y is a cover of X , since X = y Y X | y . Second, each set in the family X | Y is singly δ 1 -connected via X | Y , since trivially any two points x 1 , x 2 X | y are singly δ 1 -connected via the same set. It follows that Property 1 of Definition 6 holds.
Now, since ( X , Y ) a ( δ 1 , δ 2 ) , then by Lemma 1 for all y 1 , y 2 Y we have
m X ( X | y 1 X | y 2 ) m X ( X ) δ 1 ,
which shows that Property 2 of Definition 6 holds. Finally, it is also easy to see that Property 3 of Definition 6 holds, since X | Y contains all sets X | y . Hence, X | Y satisfies all the properties of δ 1 -overlap family, which implies that
| X | Y | | X | Y δ 1 | .
Since | X | Y | = | Y | , using Lemma A3, we also have
| X | Y | | X | Y δ 1 | .
Combining (A69), (A70) and the fact that X | Y satisfies all the properties of δ 1 -overlap family, the statement of the lemma follows. □

Appendix C.1. Taxicab Symmetry of the Mutual Information

Definition A1. 
( δ 1 , δ 2 ) -taxicab connectedness and ( δ 1 , δ 2 ) -taxicab isolation.
  • Points ( x , y ) , ( x , y ) X , Y are ( δ 1 , δ 2 ) -taxicab connected via X , Y and are denoted by ( x , y ) δ 1 , δ 2 ( x , y ) , if there exists a finite sequence { ( x i , y i ) } i = 1 N of points in X , Y such that ( x , y ) = ( x 1 , y 1 ) , ( x , y ) = ( x N , y N ) , and for all 2 < i N , we have either
    A 1 = { x i = x i 1 and m X ( X | y i X | y i 1 ) m X ( X ) > δ 1 } ,
    or
    A 2 = { y i = y i 1 and m Y ( Y | x i Y | x i 1 ) m Y ( Y ) > δ 2 } .
    If ( x , y ) δ 1 , δ 2 ( x , y ) and N = 2 , then we say that ( x , y ) and ( x , y ) are singly ( δ 1 , δ 2 ) -taxicab connected, i.e., either y = y and x , x X | y or x = x and y , y Y | x .
  • A set S X , Y is (singly) ( δ 1 , δ 2 ) -taxicab connected via X , Y if every pair of points in the set is (singly) ( δ 1 , δ 2 ) -taxicab connected in X , Y .
  • Two sets S 1 , S 2 X , Y are ( δ 1 , δ 2 ) -taxicab isolated via X , Y if no point in S 1 is ( δ 1 , δ 2 ) -taxicab connected to any point in S 2 .
Definition A2. 
Projection of a set
  • The projection S x + of a set S X , Y on the x-axis is defined as
    S x + = { x : ( x , y ) S } .
  • The projection S y + of a set S X , Y on the y-axis is defined as
    S y + = { y : ( x , y ) S } .
Definition A3. 
( δ 1 , δ 2 ) -taxicab family
A ( δ 1 , δ 2 ) -taxicab family of X , Y , denoted by X , Y ( δ 1 , δ 2 ) , is a largest family of distinct sets covering X , Y such that
1. 
Each set in the family is ( δ 1 , δ 2 ) -taxicab connected and contains at least one singly δ 1 -connected set of form X | y × { y } and at least one singly δ 2 -connected set of the form Y | x × { x } .
2. 
The measure of overlap between the projections on the x-axis and y-axis of any two distinct sets in the family are at most δ 1 m X ( X ) and δ 2 m Y ( Y ) , respectively.
3. 
For every singly ( δ 1 , δ 2 ) -connected set, there exists a set in the family containing it.
We now show that when ( X , Y ) d ( δ 1 , δ 2 ) hold, the cardinality of the ( δ 1 , δ 2 ) -taxicab family is same as the cardinality of the X | Y δ 1 -overlap family and Y | X δ 2 -overlap family.
Proof of Theorem 3. 
We will show that | X , Y ( δ 1 , δ 2 ) | = | X | Y δ 1 | . Then, | X , Y ( δ 1 , δ 2 ) | = | Y | X δ 2 | can be derived along the same lines. Hence, the statement of the theorem follows.
First, we will show that
D = { S x + : S X , Y ( δ 1 , δ 2 ) } ,
satisfies all the properties of X | Y δ 1 .
Since X , Y ( δ 1 , δ 2 ) is a covering of X , Y , we have
S x + D S x + = X ,
which implies that D is a covering of X .
Consider a set S X , Y ( δ 1 , δ 2 ) . For all ( x , y ) , ( x , y ) S , ( x , y ) and ( x , y ) are ( δ 1 , δ 2 ) -taxicab connected. Then, there exists a taxicab sequence of the form
( x , y ) , ( x 1 , y ) , ( x 1 , y 1 ) , ( x n 1 , y ) , ( x , y ) .
such that either A 1 or A 2 in Definition A1 is true. Then, the sequence { y , y 1 , , y n 1 , y } yields a sequence of conditional range { X | y ˜ j } j = 1 n + 1 such that for all 1 < j n + 1 ,
m X ( X | y ˜ j X | y ˜ j 1 ) m X ( X ) > δ 1 ,
x X | y ˜ 1 ,   and   x X | y ˜ n + 1 .
Hence, x δ 1 x via X | Y . Hence, S x + is δ 1 -connected via X | Y . Also, S contains at least one singly δ 1 -connected set of the form X | y × { y } , which implies X | y S x + . Hence, S x + contains at least one singly δ 1 -connected set of the form X | y . Hence, D satisfies Property 1 in Definition 6.
For all S 1 , S 2 X , Y ( δ 1 , δ 2 ) , we have
m X ( S 1 , x + S 2 , x + ) δ 1 m X ( X ) ,
using Property 2 in Definition A3. Hence, D satisfies Property 2 in Definition 6.
Using Property 3 in Definition A3, we hold that for all X | y × { y } , there exists a set S ( y ) X , Y ( δ 1 , δ 2 ) containing it. This implies that for all X | y X | Y , we have
X | y S ( y ) x + .
Hence, D satisfies Property 3 in Definition 6.
Thus, D satisfies all the three properties of X | Y δ 1 . This implies, along with Theorem 2, that
| D | = | X | Y δ 1 | ,
which implies that
| X , Y ( δ 1 , δ 2 ) | = | X | Y δ 1 | .
Hence, the statement of the theorem follows. □

References

  1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  2. Nair, G.N. A nonstochastic information theory for communication and state estimation. IEEE Trans. Autom. Control 2013, 58, 1497–1510. [Google Scholar] [CrossRef]
  3. Rosenfeld, M. On a problem of C.E. Shannon in graph theory. Proc. Am. Math. Soc. 1967, 18, 315–319. [Google Scholar] [CrossRef]
  4. Shannon, C. The zero error capacity of a noisy channel. IRE Trans. Inf. Theory 1956, 2, 8–19. [Google Scholar] [CrossRef]
  5. Nair, G.N. A nonstochastic information theory for feedback. In Proceedings of the 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), Maui, HI, USA, 10–13 December 2012; IEEE: Maui, HI, USA, 2012; pp. 1343–1348. [Google Scholar]
  6. Tikhomirov, V.M.; Kolmogorov, A.N. ϵ-entropy and ϵ-capacity of sets in functional spaces. Uspekhi Mat. Nauk 1959, 14, 3–86. [Google Scholar]
  7. Lim, T.J.; Franceschetti, M. Information without rolling dice. IEEE Trans. Inf. Theory 2017, 63, 1349–1363. [Google Scholar] [CrossRef]
  8. Borujeny, R.R.; Kschischang, F.R. A Signal-Space Distance Measure for Nondispersive Optical Fiber. IEEE Trans. Inf. Theory 2021, 67, 5903–5921. [Google Scholar] [CrossRef]
  9. Ferng, C.S.; Lin, H.T. Multi-label classification with error-correcting codes. In Proceedings of the Asian Conference on Machine Learning, Taoyuan, Taiwan, 13–15 November 2011; pp. 281–295. [Google Scholar]
  10. Saberi, A.; Farokhi, F.; Nair, G. Estimation and Control over a Nonstochastic Binary Erasure Channel. IFAC-PapersOnLine 2018, 51, 265–270. [Google Scholar] [CrossRef]
  11. Saberi, A.; Farokhi, F.; Nair, G.N. State Estimation via Worst-Case Erasure and Symmetric Channels with Memory. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; IEEE: Paris, France, 2019; pp. 3072–3076. [Google Scholar]
  12. Verma, G.; Swami, A. Error correcting output codes improve probability estimation and adversarial robustness of deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8646–8656. [Google Scholar]
  13. Weng, T.W.; Zhang, H.; Chen, P.Y.; Yi, J.; Su, D.; Gao, Y.; Hsieh, C.J.; Daniel, L. Evaluating the robustness of neural networks: An extreme value theory approach. arXiv 2018, arXiv:1801.10578. [Google Scholar]
  14. Wiese, M.; Johansson, K.H.; Oechtering, T.J.; Papadimitratos, P.; Sandberg, H.; Skoglund, M. Uncertain wiretap channels and secure estimation. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; IEEE: Barcelona, Spain, 2016; pp. 2004–2008. [Google Scholar]
  15. Agrawal, R. Sample mean based index policies with O (log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 1995, 27, 1054–1078. [Google Scholar] [CrossRef]
  16. Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 2002, 32, 48–77. [Google Scholar] [CrossRef]
  17. Rangi, A.; Franceschetti, M. Online learning with feedback graphs and switching costs. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; pp. 2435–2444. [Google Scholar]
  18. Rangi, A.; Franceschetti, M. Towards a non-stochastic information theory. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; IEEE: Paris, France, 2019; pp. 997–1001. [Google Scholar]
  19. Rangi, A.; Franceschetti, M. Channel Coding Theorems in Non-stochastic Information Theory. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 12–20 July 2021; IEEE: Melbourne, Australia, 2021; pp. 2295–2300. [Google Scholar]
Figure 1. Illustration of disassociation between UVs. Case (a): variables are maximally disassociated, and all conditional ranges completely overlap, in that all conditional ranges are equal to X (or Y ). Case (b): variables are disassociated at some levels ( δ 1 , δ 2 ) , and there is some overlap between at least two conditional ranges. Case (c): variables are not disassociated at any levels, and there is no overlap between the conditional ranges.
Figure 1. Illustration of disassociation between UVs. Case (a): variables are maximally disassociated, and all conditional ranges completely overlap, in that all conditional ranges are equal to X (or Y ). Case (b): variables are disassociated at some levels ( δ 1 , δ 2 ) , and there is some overlap between at least two conditional ranges. Case (c): variables are not disassociated at any levels, and there is no overlap between the conditional ranges.
Entropy 27 00472 g001
Figure 2. Illustration of the possible time intervals for the walkers on the path.
Figure 2. Illustration of the possible time intervals for the walkers on the path.
Entropy 27 00472 g002
Figure 3. The size of the equivocation set is inversely proportional to the amount of adversarial effort required to induce an error.
Figure 3. The size of the equivocation set is inversely proportional to the amount of adversarial effort required to induce an error.
Entropy 27 00472 g003
Figure 4. Illustration of the ( ϵ , δ ) -capacity in terms of packing ϵ -balls with maximum overlap δ .
Figure 4. Illustration of the ( ϵ , δ ) -capacity in terms of packing ϵ -balls with maximum overlap δ .
Entropy 27 00472 g004
Figure 5. Conditional ranges Y | x and X | y due to the ϵ -perturbation channel.
Figure 5. Conditional ranges Y | x and X | y due to the ϵ -perturbation channel.
Entropy 27 00472 g005
Figure 6. Output configuration for the computation of C ϵ δ and C ˜ ϵ δ .
Figure 6. Output configuration for the computation of C ϵ δ and C ˜ ϵ δ .
Entropy 27 00472 g006
Figure 7. Uncertainty sets associated with three different codewords. Sets are not necessarily balls; they can be different across codewords and can also be composed of disconnected subsets.
Figure 7. Uncertainty sets associated with three different codewords. Sets are not necessarily balls; they can be different across codewords and can also be composed of disconnected subsets.
Entropy 27 00472 g007
Figure 8. Illustration of capacities: This figure plots the sequence R δ n for a given sequence of δ n with respect to n > 0 .
Figure 8. Illustration of capacities: This figure plots the sequence R δ n for a given sequence of δ n with respect to n > 0 .
Entropy 27 00472 g008
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rangi, A.; Franceschetti, M. Kolmogorov Capacity with Overlap. Entropy 2025, 27, 472. https://doi.org/10.3390/e27050472

AMA Style

Rangi A, Franceschetti M. Kolmogorov Capacity with Overlap. Entropy. 2025; 27(5):472. https://doi.org/10.3390/e27050472

Chicago/Turabian Style

Rangi, Anshuka, and Massimo Franceschetti. 2025. "Kolmogorov Capacity with Overlap" Entropy 27, no. 5: 472. https://doi.org/10.3390/e27050472

APA Style

Rangi, A., & Franceschetti, M. (2025). Kolmogorov Capacity with Overlap. Entropy, 27(5), 472. https://doi.org/10.3390/e27050472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop