Next Article in Journal
Comparison of the Parameters of the Exergoeconomic Environmental Analysis of Two Combined Cycles of Three Pressure Levels with and without Postcombustion
Previous Article in Journal
Deriving the Hawking Temperature of (Massive) Global Monopole Spacetime via a Topological Formula
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analysis on Optimal Error Exponents of Binary Classification for Source with Multiple Subclasses

Department of Computer and Network Engineering, The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu 182-8585, Tokyo, Japan
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2022, 24(5), 635; https://doi.org/10.3390/e24050635
Submission received: 31 March 2022 / Revised: 25 April 2022 / Accepted: 27 April 2022 / Published: 30 April 2022

Abstract

:
We consider a binary classification problem for a test sequence to determine from which source the sequence is generated. The system classifies the test sequence based on empirically observed (training) sequences obtained from unknown sources P 1 and P 2 . We analyze the asymptotic fundamental limits of statistical classification for sources with multiple subclasses. We investigate the first- and second-order maximum error exponents under the constraint that the type-I error probability for all pairs of distributions decays exponentially fast and the type-II error probability is upper bounded by a small constant. In this paper, we first give a classifier which achieves the asymptotically maximum error exponent in the class of deterministic classifiers for sources with multiple subclasses, and then provide a characterization of the first-order error exponent. We next provide a characterization of the second-order error exponent in the case where only P 2 has multiple subclasses but P 1 does not. We generalize our results to classification in the case that P 1 and P 2 are a stationary and memoryless source and a mixed memoryless source with general mixture, respectively.

1. Introduction

1.1. Background

The problem of learning sources from training sequences and estimating the source from which a test sequence is generated is known as a classification problem. Recently, this problem has been actively studied in a field such as machine learning, and it is desirable to conduct studies that guarantee the performance of the system. In the field of information theory, studies have been conducted mainly to analyze the performance of optimal tests. When the number of sources is two, the binary classification problem can be regarded as a binary hypothesis testing problem, using training sequences. In the setting of binary hypothesis testing, it is assumed that the sources are known, but in real-world applications, the sources are not known in general. Therefore, it is of importance to consider the binary classification problem.
Hypothesis testing includes approaches such as the Bayesian test [1,2] and the Neyman–Pearson test [3,4,5]. In this paper, we take the latter approach to formulate the best asymptotic error exponent (the exponential part of an error probability).
There are a lot of studies related to the classification problem. We state important points, which are deeply connected to this study in some previous studies. Gutman [3] has shown that type-based (empirical distribution-based) tests asymptotically achieve the maximum type-II error exponent for stationary Markov sources, while the type-I error probability exponentially converges to zero as the length of a test sequence goes to infinity. Zhou et al. [5] derived second-order approximations of the maximum type-I error exponent for stationary and memoryless sources when the type-II error probability is upper bounded by a small constant. On the other hand, for the hypothesis testing problem, Han and Nomura [6] characterized a first-order maximum error exponent when each sequence is generated from a mixed memoryless source, which is a mixture of stationary and memoryless sources. In addition, they also characterized a second-order maximum error exponent in the case where one source is a stationary and memoryless source and the other source is a mixed memoryless source.

1.2. Contributions

In this paper, we investigate the binary classification problem for stationary memoryless sources with multiple subclasses. The class of sources with multiple subclasses is important in binary classification because there are many such settings in real-world applications. For example, newspaper articles with science headlines consist of topics of physics, chemistry, biology, etc. We assume that sources (subclasses) are characterized by a mixture with some unknown prior distribution (cf. Equation (2)), and the overall sources can be regarded as mixed memoryless sources [6]. The purpose of this paper is to characterize the first- and second-order maximum error exponents in a single-letter form (the term “single-letter form” means an expression which does not depend on lengths of sequences n or N (cf. the formulas for error exponents in Theorems 2–4)).
To this end, we generalize Gutman’s classifier [3], which was shown to be first- and second-order optimal for memoryless sources (with no multiple subclasses) in [5]. This classifier uses training sequences from one of the two sources, as in [3,4,5], making a type-based decision for a source (subclass) with the smallest skewed Jensen–Shannon divergence [7] among the subclasses. We show that this classifier asymptotically achieves the maximum type-II error exponent in the class of deterministic classifiers for a given pair of distributions when the type-I error probability decays exponentially fast for all pairs of distributions in Theorem 1. We also demonstrate that the structure of this classifier leads to a reversed and more relaxed relation; the maximization of the type-I error exponent when the type-II error probability is upper bounded by a small constant ϵ ( 0 ϵ < 1 ) for sources with multiple subclasses in Theorem 2. In addition, using the Berry–Esseen theorem [8], we derive the second-order maximum error exponent in the case where only one of sources has subclasses in Theorem 3. Finally, the fact that the classifier uses the test sequence from one of two sources motivates us to consider a more general case; the first source is a source with no multiple subclasses, but the second source is given by a general mixture [6,9]. That is, the number of subclasses is not necessarily finite, and the prior distribution of subclasses may not be discrete (cf. Equation (75)). We give characterizations of the first- and second-order maximum error exponents in Theorem 4.

1.3. Related Work

Ziv [10] proposed a classifier based on empirical entropy and discussed the relationship between binary classification and universal source coding. Hsu and Wang [4] characterized the maximum error exponent with mismatched empirically observed statistics. In their achievability proof, a generalization of Gutman’s classifier is also used. Kelly et al. [11] investigated binary classification with large alphabets. Unnikrishnan and Huang [12] investigated the type-I error probability of binary classification using the analysis of weak convergence. Generalizing the binary classification problem, He et al. [13] discussed the binary distribution detection problem, in which a different generalization of Gutman’s classifier is also discussed.
There are also studies which take a Bayesian approach. Merhav and Ziv [1] analyzed the weighted sum of type-I and type-II error probabilities, and subsequently, Saito and Matsushima [2,14] gave a different result via the analysis for the Bayes code.

1.4. Organization

The rest of this paper is organized as follows: In Section 2, we define the notation used in this paper and describe the details of the source and system model. Moreover, we state the problem setting, defining the first- and second-order maximum error exponents. In Section 3, we first give a classifier which achieves the asymptotically maximum error exponent in the class of deterministic classifiers for sources with multiple subclasses. Next, we characterize the first- and second-order maximum error exponents and the detailed proofs for the first-order representation. In Section 4, we generalize the obtained results to the classification of a mixed memoryless source with general mixture. In Section 5, we present numerical examples. Finally, in Section 6, we provide some concluding remarks and future work.

2. Problem Formulation

2.1. Notation

The set of non-negative real numbers is denoted by R + . Calligraphic X stands for a finite alphabet. Upper-case X denotes a random variable taking values in X , and lower-case x X denotes its realization. Throughout this paper, logarithms are of base e. For integers a and b such that a b , [ a , b ] denotes the set { a , a + 1 , , b } . The set of all probability distributions on a finite set X is denoted as P ( X ) . Notation regarding the method of types [15] is as follows: Given a vector x 1 n = ( x 1 , x 2 , , x n ) X n , the type is denoted as
q x 1 n ( a ) = 1 n i = 1 n 1 { x i = a } , a X .
The set of types formed from length-n sequences with alphabet X is denoted as P n ( X ) . The probability that n independent drawings from a probability distribution Q P ( X ) give x X n is denoted by Q ( x ) .

2.2. Source with Multiple Subclasses

Consider a source consisting of multiple subclasses. Each subclass is distributed according to a given probability (weight). Let { P 1 i } i S be a family of probability distributions on a finite alphabet X , where S = { 1 , , s } is a probability space with probability measure v ( i ) , i S . That is, the probability of x X n is given by
P 1 ( x ) = i = 1 s v ( i ) P 1 i ( x ) ,
where the i-th subclass P 1 i is a stationary and memoryless source. That is, for x = ( x 1 , x 2 , , x n ) X n ,
P 1 i ( x ) = j = 1 n P 1 i ( x j )
(for notational simplicity, we denote both the multi-letter and single-letter probabilities by P 1 i with a slight abuse of notation).
In view of (2), the sequence x can be regarded as an output from a mixed memoryless source P 1 ( · ) , and it is called a test sequence. Similarly, let { P 2 i } i U be a family of probability distributions, where U = { 1 , , u } is a probability space with probability measure w ( i ) , i U . For these mixed sources, if the sources are known, that is, the addressed problem is hypothesis testing, the first- and second-order error exponents were analyzed by Han and Nomura [6]. In this paper, we assume that the sources are unknown and training sequences are available to learn about the source. Sets of training sequences are denoted by t 1 = { t 1 1 , , t 1 s } and t 2 = { t 2 1 , , t 2 u } , where t i j X N of length N is output from subclass j and N = n γ for some fixed γ R + . Then, the joint probabilities of training sequences are, respectively,
P 1 ( t 1 ) = i = 1 s P 1 i ( t 1 i ) ,
P 2 ( t 2 ) = j = 1 u P 2 j ( t 2 j ) .
We define the class of sources with multiple subclasses on a probability space S with probability measure v ( · ) as
P S ( X ) : = P = { P i , v ( i ) } i S : P i P ( X ) ,
which means P 1 P S ( X ) , where the set of weights { v ( i ) } i S is implicitly fixed. Similarly, we define the class of sources with multiple subclasses on a probability space U with probability measure w ( · ) as
P U ( X ) : = P = { P i , w ( i ) } i U : P i P ( X ) ,
which means P 2 P U ( X ) .

2.3. System Model

The binary classification problem assumed in this paper is shown in Figure 1. It consists of two phases: (I) learning phase and (II) classification phase. We explain the details of each phase.
(I) Learning phase: Determine the classifier by learning with the training sequences t 1 = { t 1 1 , , t 1 s } and t 2 = { t 2 1 , , t 2 u } generated from unknown source P 1 P S ( X ) and P 2 P U ( X ) , respectively.
(II) Classification phase: It represents the phase in which we judge whether the test sequence x X n generated from P 1 P S ( X ) or P 2 P U ( X ) according to the classifier determined in (I).

2.4. Maximum Error Exponent

In this section, we define two error probabilities that arise in a binary classification problem and formulate the maximum error exponents. In the binary classification problem, a test is described as a partition of the space ϕ n : X n × X ( s + u ) N { 1 , 2 } . The type-I and type-II error probabilities of a given test ϕ n are denoted as β 1 ( ϕ n | P 1 , P 2 ) and β 2 ( ϕ n | P 1 , P 2 ) , respectively. That is,
β 1 ( ϕ n | P 1 , P 2 ) : = P 1 { ϕ n ( x , t 1 , t 2 ) = 2 } ,
β 2 ( ϕ n | P 1 , P 2 ) : = P 2 { ϕ n ( x , t 1 , t 2 ) = 1 } .
Here, P θ { · } is the joint probability of training and testing sequences when the underlying parameter is θ , given by
P 1 { ϕ n ( x , t 1 , t 2 ) = } : = ( x , t 1 , t 2 ) : ϕ n ( x , t 1 , t 2 ) = P 1 ( x ) i = 1 s P 1 i ( t 1 i ) j = 1 u P 2 j ( t 2 j ) { 1 , 2 } ,
P 2 { ϕ n ( x , t 1 , t 2 ) = } : = ( x , t 1 , t 2 ) : ϕ n ( x , t 1 , t 2 ) = P 2 ( x ) i = 1 s P 1 i ( t 1 i ) j = 1 u P 2 j ( t 2 j ) { 1 , 2 } .
We consider the problem of maximizing the type-I error exponent when the type-II error probability is upper bounded by a small constant ϵ [ 0 , 1 ) . In this study, we characterize the following quantities (the first- and second-order maximum type-I error exponent).
Definition 1
(First-order maximum error exponent). For any pair of distributions P = ( P 1 , P 2 ) P S ( X ) × P U ( X ) and ϵ [ 0 , 1 ) , we define
λ ^ ( ϵ ) : = sup λ R + | { ϕ n } n = 1 s . t . for all sufficiently large n β 1 ( ϕ n | P ˜ ) exp ( n λ ) ( P ˜ P S ( X ) × P U ( X ) ) , lim sup n β 2 ( ϕ n | P ) ϵ ,
where the weights of P ˜ = ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) are the sames as the weights of P = ( P 1 , P 2 ) P S ( X ) × P U ( X ) .
Definition 2
(Second-order maximum error exponent). For any pair of distributions P = ( P 1 , P 2 ) P S ( X ) × P U ( X ) and ϵ [ 0 , 1 ) , we define
r ^ ( ϵ , λ ) : = sup r | { ϕ n } n = 1 s . t . for all sufficiently large n β 1 ( ϕ n | P ˜ ) exp ( n λ n r ) ( P ˜ P S ( X ) × P U ( X ) ) , lim sup n β 2 ( ϕ n | P ) ϵ ,
where the weights of P ˜ = ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) are the sames as the weights of P = ( P 1 , P 2 ) P S ( X ) × P U ( X ) .
Remark 1.
In (12) and (13), the type-I error probability is constrained for any P ˜ P S ( X ) × P U ( X ) for technical reasons. In more detail, this condition is required in the proof of the converse part. This condition was also imposed by Gutman [3], Hsu and Wang [4], and Zhou et al. [5].
In Definitions 1 and 2, we focus on universal tests that perform well for all pairs of distributions with respect to the type-I error probability, and at the same time, constrain the type-II error probability with respect to a particular pair of distributions. We obtain the same result when the weights of ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) are not fixed.

3. Main Result

3.1. A Test to Achieve Maximum Error Exponent

The rule for estimating whether a test sequence generated from P 1 P S ( X ) or P 2 P U ( X ) is called a decision rule. One of the goals of the classification problem is to design an optimal decision rule which achieves a maximum error exponent based on training sequences. In this section, we present a decision rule that asymptotically achieves the maximum type-II error exponent for any pair of distributions when the type-I error exponent is lower bounded by a constant for all pairs of distributions (cf. Theorem 1).
To define a test that is asymptotically optimum, we define two generalizations of the Jensen–Shannon divergence. These generalizations are related to some variational definitions in [7,16]. For any pair of distributions ( Q 1 , Q 2 ) P ( X ) 2 and any number γ R + , let the generalized Jensen–Shannon divergence be
GJS ( Q 1 , Q 2 , γ ) : = γ D Q 1 | | γ Q 1 + Q 2 1 + γ + D Q 2 | | γ Q 1 + Q 2 1 + γ ,
where D ( p | | q ) denotes the Kullback–Leibler divergence for p P ( X ) and q P ( X ) defined as
D ( p | | q ) : = i X p ( i ) log p ( i ) q ( i ) .
The generalized Jensen–Shannon divergence GJS ( Q 1 , Q 2 , γ ) corresponds to a skewed α -Jensen–Shannon divergence for α = γ 1 + γ . Additionally, for ( Q 1 , Q 2 ) P S ( X ) × P ( X ) , we define the minimized generalized Jensen–Shannon divergence by
MGJS ( Q 1 , Q 2 , γ ) : = min i S GJS ( Q 1 i , Q 2 , γ ) .
Given a threshold λ R + (including λ = 0 ), the decision rule to achieve the maximum error exponent is given by
Λ 2 n = ( x , t 1 , t 2 ) X n × X s N × X u N | MGJS ( q t 1 , q x , γ ) λ ˜ ,
where λ ˜ = λ + η ( n ) , η ( n ) : = 2 log s + | X | log ( n + N + 1 ) n and Λ i n , i { 1 , 2 } is the set of ( x , t 1 , t 2 ) determined to be class i by the test Λ n . By definition, the discriminant function MGJS ( q t 1 , q x , γ ) , appearing on the right-hand side of (17), can also be expressed as
MGJS ( q t 1 , q x , γ ) = min i S γ D ( q t 1 | | q y 1 i ) + D ( q x | | q y 1 i ) ,
where y 1 i : = x t 1 i . From (17) and (18), Λ n is a type-based test and implicitly depends on λ . In addition, this test uses training sequences asymmetrically; only sequence t 1 is used, but not t 2 (cf. refs. [3,4,5]).
Theorem 1.
For any given λ R + and any sequence of tests { ϕ n } such that β 1 ( ϕ n | P ˜ ) exp ( n λ ) , P ˜ = ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) , the sequence of tests { Λ n } given by (17) satisfies for any pair of distributions P = ( P 1 , P 2 ) P S ( X ) × P U ( X ) ,
β 1 ( Λ n | P ˜ ) exp ( n λ ) ( P ˜ P S ( X ) × P U ( X ) ) ,
lim n 1 n log β 2 ( ϕ n | P ) lim n 1 n log β 2 ( Λ n | P ) .
Proof. 
Equation (19) is derived in Section 3.3.1. The proof of (20) follows from Corollary 1. Although there is a deviation between the exponents in Corollary 1 and for the test Λ n , the deviation vanishes asymptotically. □
Theorem 1 shows that the test Λ n can asymptotically achieve the maximum type-II error exponent among the tests ϕ n for which the type-I error exponent is greater than or equal to λ . This test also has a reversed and more relaxed property; it achieves λ ^ ( ϵ ) , the maximum type-I error exponent when the type-II error probability is upper bounded by a constant ϵ ( 0 ϵ < 1 ) (see the achievability proof of Theorem 2 in Section 3.3.1).

3.2. First-Order Maximum Error Exponent

In this section, we characterize the first-order maximum error exponent in a single-letter form for sources with multiple subclasses.
Theorem 2.
For any pair of distributions ( P 1 , P 2 ) P S ( X ) × P U ( X ) , we have
λ ^ ( ϵ ) = sup λ ¯ | { j U : MGJS ( P 1 , P 2 j , γ ) < λ ¯ } w ( j ) ϵ .
It should be noted that λ ^ ( ϵ ) depends on { w ( j ) } j U , but not on { v ( i ) } i S .
Proof. 
The proof is provided in Section 3.3. □
Remark 2.
If S and U are singletons (that is, s = 1 and u = 1 ), Theorem 2 reduces to the following formula given by Zhou et al. [5]:
λ ^ ( ϵ ) = GJS ( P 1 , P 2 , γ ) ( 0 ϵ < 1 ) ,
which means that λ ^ ( ϵ ) does not depend on ϵ and the strong converse holds in this case, unlike in the case | S | , | U | > 1 . On the other hand, for general S and U but in the spacial case of ϵ = 0 , formula (21) reduces to
λ ^ ( 0 ) = min i S , j U GJS ( P 1 i , P 2 j , γ ) .

3.3. Proof of Theorem 2

We divide the proof of Theorem 2 into two parts: the achievability (direct) part and the converse part.

3.3.1. Achievability Part

In the achievability proof, we use the type-based test Λ n given by (17). Fix any
λ < sup λ ¯ | { j U : MGJS ( P 1 , P 2 j , γ ) < λ ¯ } w ( j ) ϵ .
Then, for any pair of distributions ( P 1 , P 2 ) P S ( X ) × P U ( X ) and for all pairs of distributions ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) , we show
β 1 ( Λ n | P ˜ 1 , P ˜ 2 ) exp ( n λ ) ,
lim sup n β 2 ( Λ n | P 1 , P 2 ) ϵ .
First, we prove (25). For preliminaries, we define the following sets used in the proof:
T n ( q x ) : = x X n : q x = q x ,
Λ ˜ 2 n : = ( x , t 1 ) : MGJS ( q t 1 , q x , γ ) λ ˜ ,
Γ ( Λ ) : = ( q x , q t 1 ) : ( x , t 1 ) Λ ,
where Λ ˜ 2 n is the projection of Λ 2 n X n × X s N × X u N onto the space X n × X s N . To evaluate the probability of a source sequence being in T n ( q x ) , the following relationship holds from the method of types [15].
Lemma 1.
Suppose that the sequence x is sampled independently from the source P P ( X ) . Then,
1 | P n ( X ) | exp n D ( q x | | P ) P ( T n ( q x ) ) exp n D ( q x | | P ) ,
where P n ( X ) denotes the set of types formed from length-n sequences with alphabet X and | P n ( X ) | ( n + 1 ) | X | .
Then, an upper bound on the type-II error probability of the test Λ n = ( Λ 1 n , Λ 2 n ) for all pairs of distributions ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) can be evaluated as follows:
β 1 ( Λ n | P ˜ 1 , P ˜ 2 )
= ( x , t 1 , t 2 ) Λ 2 n P ˜ 1 ( x ) r = 1 s P ˜ 1 r ( t 1 r ) j = 1 u P ˜ 2 j ( t 2 j ) = ( x , t 1 ) Λ ˜ 2 n i = 1 s P ˜ 1 i ( x ) r = 1 s P ˜ 1 r ( t 1 r ) v ( i ) = ( q x , q t 1 ) Γ ( Λ 2 n ) i = 1 s P ˜ 1 i ( T n ( q x ) ) r = 1 s P ˜ 1 r ( T N ( q t 1 r ) ) v ( i )
( q x , q t 1 ) Γ ( Λ 2 n ) i = 1 s exp n D ( q x | | P ˜ 1 i ) r = 1 s N D ( q t 1 r | | P ˜ 1 r ) ( q x , q t 1 ) Γ ( Λ 2 n ) i = 1 s exp n D ( q x | | P ˜ 1 i ) N D ( q t 1 i | | P ˜ 1 i )
= ( q x , q t 1 ) Γ ( Λ 2 n ) i = 1 s exp n D ( q x | | q y 1 i ) N D ( q t 1 i | | q y 1 i ) n ( 1 + γ ) D q x + γ q t 1 i 1 + γ | | P ˜ 1 i ,
where (31) is derived from (4), (32) follows from Lemma 1 and (33) is derived in Appendix A. Minimizing exponents in (33) with respect to i S , we further obtain
β 1 ( Λ n | P ˜ 1 , P ˜ 2 ) ( q x , q t 1 ) Γ ( Λ 2 n ) s exp n min i S D ( q x | | q y 1 i ) + γ D ( q t 1 i | | q y 1 i ) · exp n min i S ( 1 + γ ) D q x + γ q t 1 i 1 + γ | | P ˜ 1 i
( q x , q t 1 ) Γ ( Λ 2 n ) s exp n λ ˜ · exp n min i S ( 1 + γ ) D q x + γ q t 1 i 1 + γ | | P ˜ 1 i = exp n λ ˜ log s n · ( q x , q t 1 ) Γ ( Λ 2 n ) exp n min i S ( 1 + γ ) D q x + γ q t 1 i 1 + γ | | P ˜ 1 i exp n λ ˜ log s n · Q P n + N ( X ) exp min i S ( n + N ) D ( Q | | P ˜ 1 i )
exp n λ ˜ log s n ( n + N + 1 ) | X | · Q P n + N ( X ) max i S P ˜ 1 i ( T n + N ( Q ) ) exp n λ ˜ log s n ( n + N + 1 ) | X | · Q P n + N ( X ) i = 1 s P ˜ 1 i ( T n + N ( Q ) ) = exp n λ ˜ log s n ( n + N + 1 ) | X | · i = 1 s Q P n + N ( X ) P ˜ 1 i ( T n + N ( Q ) ) = exp n λ ˜ log s n ( n + N + 1 ) | X | s = exp n λ ˜ 2 log s n | X | log ( n + N + 1 ) n = exp n λ ,
where (34) is derived from (16) and (28), and (35) follows from Lemma 1.
Next, we demonstrate (26). For preliminaries, we define a new typical set and show some properties for the proof. For any given Q P ( X ) , define the following typical set:
B n ( Q ) : = x X n : max x X | q x ( x ) Q ( x ) | log n n .
By using [6, Lemma 22], for t 1 Q 1 , x Q 2 , generated from memoryless sources ( Q 1 , Q 2 ) P ( X ) 2 , we have
Pr t 1 B N ( Q 1 ) or x B n ( Q 2 ) ( 1 + γ 2 ) | X | γ 2 n 2 .
For any given x X and any pair of distributions ( Q 1 , Q 2 ) P ( X ) 2 , we define information density as
ι i ( x | Q 1 , Q 2 , γ ) : = log ( 1 + γ ) Q i ( x ) γ Q 1 ( x ) + Q 2 ( x ) ( i = 1 , 2 ) .
Furthermore, for any pair of distributions ( Q 1 , Q 2 ) P S ( X ) × P ( X ) , define the function i ( Q 1 , Q 2 ) , as the index of the subclass of Q 1 , as follows:
i ( Q 1 , Q 2 ) : = arg min i S γ D Q 1 i | | γ Q 1 i + Q 2 1 + γ + D Q 2 | | γ Q 1 i + Q 2 1 + γ = arg min i S GJS ( Q 1 i , Q 2 , γ ) .
Hereafter, we denote i ( Q 1 , Q 2 ) simply as i ( Q 2 ) when the first argument is clear from the context.
Lemma 2.
Assume that ( Q 1 , Q 2 ) P S ( X ) × P ( X ) . For q t 1 i B N ( Q 1 i ) ( i S ) , q x B n ( Q 2 ) , we have
i ( q t 1 , q x ) i ( Q 1 , Q 2 ) ( n ) .
Proof. 
The proof is provided in Appendix B. □
Lemma 3
(Zhou et al. [5]). Assume that ( Q 1 , Q 2 ) P S ( X ) × P ( X ) . For q t 1 i B N ( Q 1 i ) ( i S ) , q x B n ( Q 2 ) , by applying the Taylor expansion to GJS ( q t 1 i , q x , γ ) around ( Q 1 i , Q 2 ) , we have
GJS ( q t 1 i , q x , γ ) = γ N m = 1 N ι 1 ( t 1 , m i | Q 1 i , Q 2 , γ ) + 1 n m = 1 n ι 2 ( x m | Q 1 i , Q 2 , γ ) + O log n n ,
where t 1 , m i denotes the m-th symbol of sequence t 1 i and x m denotes the m-th symbol of sequence x .
Note that the probability P 2 is calculated by assuming that the test sequence x is generated from P 2 (cf. Equation (11)). An upper bound on the type-II error probability can be evaluated as follows:
lim sup n β 2 ( Λ n | P 1 , P 2 ) = lim sup n P 2 MGJS ( q t 1 , q x , γ ) < λ + η ( n ) = lim sup n j = 1 u w ( j ) P 2 j MGJS ( q t 1 , q x , γ ) < λ + η ( n ) lim sup n j = 1 u w ( j ) P 2 j MGJS ( q t 1 , q x , γ ) < λ + η ( n ) , x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S + lim sup n i = 1 s j = 1 u w ( j ) · P 2 j x B n ( P 2 j ) or t 1 i B N ( P 1 i ) lim sup n j = 1 u w ( j ) P 2 j GJS ( q t 1 i ( q x ) , q x , γ ) < λ + η ( n ) , x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S ,
where Equation (43) follows from Equations (14), (38) and (40). By using Lemma 3, Equation (43) can be expanded as follows:
lim sup n β 2 ( Λ n | P 1 , P 2 ) lim sup n j = 1 u w ( j ) P 2 j { γ N m = 1 N ι 1 ( t 1 , m i ( q x ) | P 1 i ( q x ) , P 2 j , γ ) + 1 n m = 1 n ι 2 ( x m | P 1 i ( q x ) , P 2 j , γ ) < λ + O log n n , x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S } j = 1 u lim sup n w ( j ) P 2 j { γ N m = 1 N ι 1 ( t 1 , m i ( q x ) | P 1 i ( q x ) , P 2 j , γ ) + 1 n m = 1 n ι 2 ( x m | P 1 i ( q x ) , P 2 j , γ ) < λ
+ O log n n , x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S } = j = 1 u lim sup n w ( j ) P 2 j x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S · P 2 j { γ N m = 1 N ι 1 ( t 1 , m i ( q x ) | P 1 i ( q x ) , P 2 j , γ ) + 1 n m = 1 n ι 2 ( x m | P 1 i ( q x ) , P 2 j , γ ) < λ
+ O log n n x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S ,
where Equation (44) is derived from Fatou’s lemma. It follows from (38) that
P 2 j t 1 i B N ( P 1 i ) , x B n ( P 2 j ) , i S 1 ( n ) .
Here, Equation (14) can be also expressed as follows:
GJS ( Q 1 , Q 2 , γ ) = γ E Q 1 ι 1 ( X | Q 1 , Q 2 , γ ) + E Q 2 ι 2 ( X | Q 1 , Q 2 , γ ) .
Therefore, by the weak law of large numbers, for any given δ > 0 ,
lim sup n Pr { | γ N m = 1 N ι 1 ( t 1 , m i ( q x ) | P 1 i ( q x ) , P 2 j , γ ) + 1 n m = 1 n ι 2 ( x m | P 1 i ( q x ) , P 2 j , γ ) GJS ( P 1 i ( q x ) , P 2 j , γ ) | δ } = 1 .
From this result of the weak law of large numbers and Lemma 2, combining (45)–(48) gives
lim sup n β 2 ( Λ n | P 1 , P 2 ) { j U : GJS ( P 1 i ( P 2 j ) , P 2 j , γ ) < λ } w ( j ) .
Thus, by (24), we can see that
lim sup n β 2 ( Λ n | P 1 , P 2 ) ϵ .

3.3.2. Converse Part

For any pair of distributions ( P 1 , P 2 ) P S ( X ) × P U ( X ) and for all pairs of distributions ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) , fix any test ϕ n satisfying that
β 1 ( ϕ n | P ˜ 1 , P ˜ 2 ) exp ( n λ ) ,
lim sup n β 2 ( ϕ n | P 1 , P 2 ) ϵ .
We show
λ sup λ ¯ | { j U : MGJS ( P 1 , P 2 j , γ ) < λ ¯ } w ( j ) ϵ .
We first give some lemmas, which are useful in the proof of the converse part.
Lemma 4.
Let ϕ n be a test in which the decision rule depends only on ( x , t 1 , t 2 ) X n × X s N × X u N . Then, for any given κ [ 0 , 1 ] , we can construct a type-based test Ω n satisfying
β 1 ( ϕ n | P ˜ 1 , P ˜ 2 ) κ β 1 ( Ω n | P ˜ 1 , P ˜ 2 ) ,
β 2 ( ϕ n | P ˜ 1 , P ˜ 2 ) ( 1 κ ) β 2 ( Ω n | P ˜ 1 , P ˜ 2 )
for any pair of distributions ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) .
Proof. 
Lemma 4 can be proved in the same way as (Lemma 7 [5]), the proof of which is inspired by (Lemma 2 [3]). □
Remark 3.
As in the proof of (Lemma 2 [3]) and (Lemma 7 [5]), a type-based test Ω n specified in Lemma 4 is obtained by tailoring ϕ n and satisfies Equations (54) and (55) for all ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) . In other words, the construction of Ω n is universal, which is in the same spirit of (Lemma 2 [3]). This claim is slightly stronger than the one in (Lemma 7 [5]).
Lemma 5.
For any λ R + , any type-based test Ω n satisfying the condition that for all pairs of distributions ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) ,
β 1 ( Ω n | P ˜ 1 , P ˜ 2 ) exp ( n λ ) ,
we have that for any pair of distributions ( P 1 , P 2 ) P S ( X ) × P U ( X )
β 2 ( Ω n | P 1 , P 2 ) P 2 MGJS ( q t 1 , q x , γ ) < λ ρ ( n ) ,
where ρ ( n ) : = | X | log ( n + 1 ) + ( s + u ) | X | log ( N + 1 ) log v n with v : = min { v ( i ) : v ( i ) > 0 , i S } .
Proof. 
The proof is provided in Appendix C. □
The type-based test Ω n specified in Lemma 4 satisfies Equations (54) and (55) for all ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) . If we set κ = 1 / n in Lemma 4, and combine it with Lemma 5, we can derive the following relation:
Corollary 1.
For any given λ R + , any test ϕ n satisfying the condition that for all pairs of distributions ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X )
β 1 ( ϕ n | P ˜ 1 , P ˜ 2 ) exp ( n λ ) ,
we have that for any pair of distributions ( P 1 , P 2 ) P S ( X ) × P U ( X )
β 2 ( ϕ n | P 1 , P 2 ) 1 1 n · P 2 MGJS ( q t 1 , q x , γ ) + ρ ( n ) + log n n < λ .
Proof. 
The proof is provided in Appendix D. □
By (59), a lower bound on the type-II error probability can be evaluated as follows:
lim sup n β 2 ( ϕ n | P 1 , P 2 ) lim sup n 1 1 n P 2 MGJS ( q t 1 , q x , γ ) + O log n n < λ = lim sup n j = 1 u w ( j ) P 2 j MGJS ( q t 1 , q x , γ ) + O log n n < λ lim sup n j = 1 u w ( j ) P 2 j MGJS ( q t 1 , q x , γ ) + O log n n < λ , x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S = lim sup n j = 1 u w ( j ) · P 2 j { 1 n m = 1 N ι 1 ( t 1 , m i ( q x ) | P 1 i ( q x ) , P 2 j , γ ) + 1 n m = 1 n ι 2 ( x m | P 1 i ( q x ) , P 2 j , γ )
+ O log n n < λ , x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S } j = 1 u lim inf n w ( j ) · P 2 j { 1 n m = 1 N ι 1 ( t 1 , m i ( q x ) | P 1 i ( q x ) , P 2 j , γ ) + 1 n m = 1 n ι 2 ( x m | P 1 i ( q x ) , P 2 j , γ )
+ O log n n < λ , x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S } j = 1 u lim inf n w ( j ) P 2 j x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S · P 2 j { γ N m = 1 N ι 1 ( t 1 , m i ( q x ) | P 1 i ( q x ) , P 2 j , γ ) + 1 n m = 1 n ι 2 ( x m | P 1 i ( q x ) , P 2 j , γ ) < λ + O log n n x B n ( P 2 j ) , t 1 i B N ( P 1 i ) , i S = { j U : GJS ( P 1 i ( P 2 j ) , P 2 j , γ ) < λ } w ( j ) ,
where Equations (60) and (61) are derived from Lemma 3 and Fatou’s lemma, respectively. Equation (62) follows from Equations (45)–(48). By Equation (52), Equation (62) indicates that
λ sup λ ¯ | { j U : GJS ( P 1 i ( P 2 j ) , P 2 j , γ ) < λ ¯ } w ( j ) ϵ .

3.4. Second-Order Maximum Error Exponent

In this section, we characterize the second-order maximum error exponent. For simplicity, we assume that only P 2 has subclasses, but P 1 does not ( s = 1 ). First, from Theorem 2 with s = 1 , the first-order maximum error exponent in this setting is characterized as follows: for any pair of distributions ( P 1 , P 2 ) P ( X ) × P U ( X ) , we have
λ ^ ( ϵ ) = sup λ ¯ | { j U : GJS ( P 1 , P 2 j , γ ) < λ ¯ } w ( j ) ϵ ,
where MGJS ( P 1 , P 2 j , γ ) in (21) is replaced by GJS ( P 1 , P 2 j , γ ) .
Next, we provide a characterization of the second-order maximum error exponent in Definition 2 with s = 1 in the case where only P 2 has subclasses. By definition, r ^ ( ϵ , λ ) = + if λ < λ ^ ( ϵ ) and r ^ ( ϵ , λ ) = if λ > λ ^ ( ϵ ) . Therefore, in the discussion of the second-order error exponent, we focus on the case λ = λ ^ ( ϵ ) .
Theorem 3.
For any pair of distributions ( P 1 , P 2 ) P ( X ) × P U ( X ) and ϵ [ 0 , 1 ) ,
r ^ ( ϵ , λ ) = sup r | j U : GJS ( P 1 , P 2 j , γ ) < λ w ( j ) + j U : GJS ( P 1 , P 2 j , γ ) = λ Φ j ( r ) w ( j ) ϵ ,
where λ = λ ^ ( ϵ ) ,
Φ j ( r ) : = G r V ( P 1 , P 2 j , γ ) ,
G ( a ) : = 1 2 π a e x 2 2 d x ,
which is the cumulative distribution function of the standard Gaussian distribution, and for any pair of distributions ( Q 1 , Q 2 ) P ( X ) 2 ,
V ( Q 1 , Q 2 , γ ) : = γ Var Q 1 [ ι 1 ( X | Q 1 , Q 2 , γ ) ] + Var Q 2 [ ι 2 ( X | Q 1 , Q 2 , γ ) ] ,
where Var Q [ · ] represents the variances with respect to Q P ( X ) .
Proof. 
The proof is provided in Appendix E. □
Remark 4.
If U is a singleton ( u = 1 ), Theorem 3 reduces to r ^ ( ϵ , λ ) = V ( P 1 , P 2 j , γ ) G 1 ( ϵ ) for λ = λ ^ ( ϵ ) = GJS ( P 1 , P 2 j , γ ) , which is the same result given by Zhou et al. [5].
Remark 5.
We can summarize the two terms on the right-hand side of (65) into the following single term called the canonical equation [6]:
j U w ( j ) lim n Φ j ( n ( λ GJS ( P 1 , P 2 j , γ ) ) + r ) .
We focus on the case λ = λ ^ ( ϵ ) . From Theorem 2, it holds that
{ j U : MGJS ( P 1 , P 2 j , γ ) < λ } w ( j ) ϵ
and
{ j U : MGJS ( P 1 , P 2 j , γ ) λ } w ( j ) ϵ .
Here, let us consider the following canonical equation for r
j U w ( j ) lim n Φ j ( n ( λ ^ ( ϵ ) GJS ( P 1 , P 2 j , γ ) ) + r ) = ϵ .
Thus, in view of (70) and (71), this equation always has the solution r = r ( ϵ ) . If
j U : GJS ( P 1 , P 2 j , γ ) = λ ^ ( ϵ ) w ( j ) = 0 ,
the solution is not unique ( r ( ϵ ) = + ). By using the solution r ( ϵ ) , Equation (65) with λ = λ ^ ( ϵ ) can be expressed in a simpler form
r ^ ( ϵ , λ ) = r ( ϵ ) .

4. Generalization to Mixed Memoryless Sources with General Mixture

In this section, we consider the classification problem in the case where P 1 does not have subclasses and P 2 is given by a general mixture model. The general mixture model considered in this problem represents an extension of the source with multiple subclasses defined in Section 2.2. Since the decision rule that achieves the maximum error exponent can be operated using only one of the training sequences, we assume in this section that only the training sequence t 1 is available. Then, we provide a characterization of the maximum error exponents in a single-letter form under this setting. First, we define the source referred to as a mixed memoryless source with general mixture [6,9] as follows. Let Θ be an arbitrary probability space with a general probability measure w ( θ ) , θ Θ . Then, the probability of x X n is given by
P 2 ( x ) = θ Θ P 2 θ ( x ) d w ( θ ) ,
where P 2 θ is a stationary and memoryless source for each θ Θ . That is, for x = ( x 1 , x 2 , , x n ) X n
P 2 θ ( x ) = i = 1 n P 2 θ ( x i ) .
When a test sequence is output from P 2 , the probability distribution of the sequence takes the form of (75). Here, type-I and type-II error probabilities of a test ϕ n = ( ϕ 1 n , ϕ 2 n ) are given by
β 1 ( ϕ n | P 1 , P 2 ) = ( x , t 1 ) ϕ 2 n P 1 ( x ) P 1 ( t 1 ) ,
β 2 ( ϕ n | P 1 , P 2 ) = ( x , t 1 ) ϕ 1 n P 2 ( x ) P 1 ( t 1 ) .
Theorem 4.
(First-and second-order maximum error exponents) For any pair of distributions ( P 1 , P 2 ) P ( X ) × P Θ ( X ) and ϵ [ 0 , 1 ) , we have
λ ^ ( ϵ ) = sup λ ¯ | { θ Θ : GJS ( P 1 , P 2 θ , γ ) < λ ¯ } d w ( θ ) ϵ
and
r ^ ( ϵ , λ ) = sup r | θ Θ : GJS ( P 1 , P 2 θ , γ ) < λ d w ( θ ) + θ Θ : GJS ( P 1 , P 2 θ , γ ) = λ Φ θ ( r ) d w ( θ ) ϵ .
Proof. 
We can prove this theorem in the same way as Theorems 2 and 3. □

5. Numerical Calculation

5.1. First-Order Maximum Error Exponent

In this section, we present a numerical example to illustrate the first-order maximum type-I error exponent λ ^ ( ϵ ) characterized in Theorem 2.
A numerical example of the first-order maximum error exponent is given by calculating the right-hand side of (64) for the following settings. We assume that X = { 0 , 1 } . We fix the set of probabilities and weights
P 1 = Bern ( 0.389 ) , v ( 1 ) = 1 3 , Bern ( 0.322 ) , v ( 2 ) = 1 3 , Bern ( 0.256 ) , v ( 3 ) = 1 3
and
P 2 = Bern ( 0.301 ) , w ( 1 ) = 1 3 , Bern ( 0.244 ) , w ( 2 ) = 1 6 , Bern ( 0.223 ) , w ( 3 ) = 1 2 ,
where Bern ( · ) denotes the Bernoulli distribution. The relation among λ ^ ( ϵ ) , ϵ and γ is shown in Figure 2. Additionally, for γ = 2 , the behavior of λ ^ ( ϵ ) is depicted in Figure 3. When ϵ becomes larger, the value of λ ^ ( ϵ ) also increases like a step function. The step increases when ϵ = 1 6 and ϵ = 1 2 . We can also confirm that λ ^ ( ϵ ) is right-continuous in ϵ . This is due to the limit that the superior of the type-II error probability is constrained in Definition 1.

5.2. Second-Order Maximum Error Exponent

As in the previous subsection, we present a numerical example to illustrate the second-order maximum type-I error exponent r ^ ( ϵ , λ ) characterized in Theorem 3.
A numerical example of the second-order maximum error exponent is given by calculating the right-hand side of (65) for the following settings. We assume that γ = 2 , X = { 0 , 1 } . We fix λ = λ ^ ( ϵ ) and the set of probabilities and weights
P 1 = Bern ( 0.268 )
and
P 2 = Bern ( 0.301 ) , w ( 1 ) = 1 3 , Bern ( 0.244 ) , w ( 2 ) = 1 6 , Bern ( 0.223 ) , w ( 3 ) = 1 2 ,
where P 2 is the same as the setting in the previous subsection. The behavior of r ^ ( ϵ , λ ) is shown in Figure 4. The value of r ^ ( ϵ , λ ) takes the inverse of the cumulative distribution function of the standard Gaussian for each interval of ϵ such that 0 ϵ < 1 6 , 1 6 ϵ < 1 2 and 1 2 ϵ < 1 . In contrast to the first-order λ ^ ( ϵ ) , r ^ ( ϵ , λ ) is no longer right continuous in ϵ .

6. Conclusions

For binary classification of sources with multiple subclasses, we characterized the first- and second-order maximum error exponents. First, we revealed the first-order maximum error exponent in the case where P 1 and P 2 are sources with multiple subclasses. In order to derive this representation, we gave a classifier which achieves the asymptotically maximum error exponent in the class of deterministic classifiers for sources with multiple subclasses.
Next, we showed the second-order maximum error exponent in the case where only one of sources has subclasses. The most important key technique to derive the second-order maximum error exponent is to apply the Berry–Esseen theorem [8] instead of the weak law of large numbers. One may wonder whether we can also derive the second-order approximation in the case where P 1 is also a source with multiple subclasses. To this end, we need to evaluate Lemma 2 more rigorously. This is future work.
In addition, for binary classification using only a training sequence generated from P 1 in the case where P 1 does not have subclasses and P 2 is given by a general mixture model, we generalized the analysis for the first- and second-order error exponents. From these results, we revealed the asymptotic performance limits of statistical classification for sources with multiple subclasses.
In this paper, we considered a binary classification problem, but in practice, multiclass classification is of importance. In the case where each class is a memoryless source (without multiple subclasses), the first- and second-order maximum error exponents were analyzed in [5]. Extending the obtained results to multiclass classification for sources with multiple subclasses is also a subject of future studies.

Author Contributions

Author H.K. contributed to the conceptualization of the research goals and aims, the visualization, the formal analysis of the results, and the review and editing. Author H.Y. contributed to the conceptualization of the ideas, the validation of the results, and the supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by JSPS KAKENHI Grant Numbers JP20K04462 and JP18H01438.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof (Proof of Equation (33)).
We shall show
N D ( q t 1 | | P ˜ 1 ) + n D ( q x | | P ˜ 1 ) = N D ( q t 1 | | q y 1 ) + n D ( q x | | q y 1 ) + n ( 1 + γ ) D γ q t 1 + q x 1 + γ | | P ˜ 1 .
The left-hand side of (A1) is expanded as follows:
N D ( q t 1 | | P ˜ 1 ) + n D ( q x | | P ˜ 1 ) = n γ x X q t 1 ( x ) log q t 1 ( x ) P ˜ 1 ( x ) + n x X q x ( x ) log q x ( x ) P ˜ 1 ( x ) n γ x X q t 1 ( x ) log γ q t 1 ( x ) + q x 1 ( x ) 1 + γ + n γ x X q t 1 ( x ) log γ q t 1 ( x ) + q x ( x ) 1 + γ n x X q x ( x ) log γ q t 1 ( x ) + q x ( x ) 1 + γ + n x X q x ( x ) log γ q t 1 ( x ) + q x 1 ( x ) 1 + γ = n γ E q t 1 log ( 1 + γ ) q t 1 ( x ) γ q t 1 ( x ) + q x ( x ) + n E q x log ( 1 + γ ) q x ( x ) γ q t 1 ( x ) + q x ( x ) n x X ( γ q t 1 ( x ) + q x ( x ) ) log P ˜ 1 ( x ) + n x X ( γ q t 1 ( x ) + q x ( x ) ) log γ q t 1 ( x ) + q x ( x ) 1 + γ = N D ( q t 1 | | q y 1 ) + n D ( q x | | q y 1 ) + n ( 1 + γ ) D γ q t 1 + q x 1 + γ | | P ˜ 1 .
Therefore, we obtain Equation (33). □

Appendix B

Proof of Lemma 2).
Applying the Taylor expansion to GJS ( q t 1 i , q x , γ ) around ( Q 1 i , Q 2 ) for any q x B n ( Q 2 ) , q t 1 i B N ( Q 1 i ) ( i S ) , we obtain
GJS ( q t 1 i , q x , γ ) = GJS ( Q 1 i , Q 2 , γ ) + O log n n .
Therefore by (14) for any q x B n ( Q 2 ) , q t 1 i B N ( Q 1 i ) ( i S ) ,
γ D q t 1 i | | q x + γ q t 1 i 1 + γ + D q x | | q x + γ q t 1 i 1 + γ γ D Q 1 i | | Q 2 + γ Q 1 i 1 + γ + D Q 2 | | Q 2 + γ Q 1 i 1 + γ ( n )
holds, and the convergence is uniform in i S . Since for any ( Q 1 , Q 2 ) P S ( X ) × P ( X ) , i ( Q 1 , Q 2 ) was given in the form
i ( Q 1 , Q 2 ) = arg min i S γ D Q 1 | | Q 2 + γ Q 1 1 + γ + D Q 2 | | Q 2 + γ Q 1 1 + γ ,
we obtain
i ( q t 1 , q x ) i ( Q 1 , Q 2 ) ( n ) .

Appendix C

Proof of Lemma 5.
It follows from (56) that
exp n λ
( x , t 1 , t 2 ) Ω 2 n P ˜ 1 ( x ) r = 1 s P ˜ 1 r ( t 1 r ) j = 1 u P ˜ 2 j ( t 2 j ) = ( x , t 1 , t 2 ) Ω 2 n i = 1 s P ˜ 1 i ( x ) r = 1 s P ˜ 1 r ( t 1 r ) j = 1 u P ˜ 2 j ( t 2 j ) v ( i ) ( x , t 1 , t 2 ) Ω 2 n P ˜ 1 i ( x ) r = 1 s P ˜ 1 r ( t 1 r ) j = 1 u P ˜ 2 j ( t 2 j ) v ( i ) ( i S , v ( i ) > 0 ) = ( q x , q t 1 , q t 2 ) Γ ( Ω 2 n ) P ˜ 1 i ( T n ( q x ) ) r = 1 s P ˜ 1 r ( T N ( q t 1 r ) ) j = 1 u P ˜ 2 j ( T N ( q t 2 j ) ) v ( i ) P ˜ 1 i ( T n ( q x ) ) r = 1 s P ˜ 1 r ( T N ( q t 1 r ) ) j = 1 u P ˜ 2 j ( T N ( q t 2 j ) ) v ( i ) ( ( q x , q t 1 , q t 2 ) Γ ( Ω 2 n ) ) .
Using (30) on the right-hand side of (A7), we obtain
exp n λ exp n ρ ( n ) exp n D ( q x | | P ˜ 1 i ) + r = 1 s γ D ( q t 1 r | | P ˜ 1 r ) + j = 1 u γ D ( q t 2 j | | P ˜ 2 j ) ,
where ρ ( n ) = | X | log ( n + 1 ) + ( s + u ) | X | log ( N + 1 ) log v n . Taking the negative logarithm of both sides and divide by n in (A8), we have
λ ρ ( n ) D ( q x | | P ˜ 1 i ) + r = 1 s γ D ( q t 1 r | | P ˜ 1 r ) + j = 1 u γ D ( q t 2 j | | P ˜ 2 j ) .
Since ( P ˜ 1 i , P ˜ 1 r , P ˜ 2 j ) P ( X ) 3 is arbitrary in (A9), we can set the i-th subclass as P ˜ 1 i = q x t 1 i = q y 1 i and the other as P ˜ 1 r = q t 1 r , r i . Furthermore, we set P ˜ 2 j = q t 2 j , j U . Then we obtain
λ ρ ( n ) D ( q x | | q y 1 i ) + γ D ( q t 1 i | | q y 1 i ) ( q x , q t 1 , q t 2 ) Γ ( Ω 2 n ) , i S λ ρ ( n ) min i S D ( q x | | q y 1 i ) + γ D ( q t 1 i | | q y 1 i ) ( ( q x , q t 1 , q t 2 ) Γ ( Ω 2 n ) ) .
Therefore by (18), (A10) implies that Ω 2 n is constrained in
Λ ¯ 2 n : = ( x , t 1 , t 2 ) | MGJS ( q t 1 , q x , γ ) λ ρ ( n ) ,
and for any pair of distributions P = ( P 1 , P 2 ) P S ( X ) × P U ( X ) , we obtain
Ω 2 n Λ ¯ 2 n Λ ¯ 1 n Ω 1 n β 2 ( Λ ¯ n | P ) β 2 ( Ω n | P ) .

Appendix D

Proof of Corollary 1.
It follows from (58) that
exp ( n λ ) β 1 ( ϕ n | P ˜ 1 , P ˜ 2 ) 1 n β 1 ( Ω n | P ˜ 1 , P ˜ 2 ) ,
where Lemma 4 with κ = 1 / n guarantees the existence of such a type-based test Ω n satisfying (A13) for ( P ˜ 1 , P ˜ 2 ) P S ( X ) × P U ( X ) and (55) for ( P 1 , P 2 ) P S ( X ) × P U ( X ) (cf. Remark 3). Then, the type-I error probability of Ω n satisfies
β 1 ( Ω n | P ˜ 1 , P ˜ 2 ) n exp ( n λ ) = exp n λ log n n = exp ( n λ ) ,
where λ = λ log n n . From Lemma 5, the type-II error probability of Ω n satisfies
β 2 ( Ω n | P 1 , P 2 ) P 2 MGJS ( q t 1 , q x , γ ) < λ ρ ( n ) .
Combining (A15) and (55) for ( P 1 , P 2 ) P S ( X ) × P U ( X ) , we have
β 2 ( ϕ n | P 1 , P 2 ) 1 1 n P 2 MGJS ( q t 1 , q x , γ ) < λ ρ ( n ) = 1 1 n P 2 MGJS ( q t 1 , q x , γ ) + ρ ( n ) + log n n < λ ,
establishing (59). □

Appendix E

Proof of Theorem 3.
We divide the proof of Theorem 3 into two parts: the achievability (direct) part and the converse part. First, for preliminaries, we define the following quantity used in the proof: For ( Q 1 , Q 2 ) P ( X ) 2 ,
T ( Q 1 , Q 2 , γ ) : = γ E Q 1 | ι 1 ( X | Q 1 , Q 2 , γ ) E Q 1 [ ι 1 ( X | Q 1 , Q 2 , γ ) ] | 3 + E Q 2 | ι 2 ( X | Q 1 , Q 2 , γ ) E Q 2 [ ι 2 ( X | Q 1 , Q 2 , γ ) ] | 3 .
Next, we give a lemma that is important in the proof of Theorem 3. □
Lemma A1
(The Berry–Esseen theorem [8]). Let X k , k = 1 , . . . , n be independent with
μ k = E [ X k ] ,
σ 2 = k = 1 n Var [ X k ] ,
T = k = 1 n E [ | X k μ k | 3 ] ,
Q ( x ) = x 1 2 π exp t 2 2 d t .
Then for any < λ < , it holds that
P k = 1 n ( X k μ k ) λ σ Q ( λ ) 6 T σ 3 .

Appendix E.1. Achievability Part

In the achievability proof, we use the following test Λ n = ( Λ 1 n , Λ 2 n ) :
Λ 2 n = ( x , t 1 , t 2 ) | GJS ( q t 1 , q x , γ ) λ ˜ + r n ,
where MGJS ( q t 1 , q x , γ ) in (17) is now GJS ( q t 1 , q x , γ ) because the first source is assumed to be a stationary and memoryless source. Fix any
r < sup r | j U : GJS ( P 1 , P 2 j , γ ) < λ w ( j ) + j U : GJS ( P 1 , P 2 j , γ ) = λ Φ j ( r ) w ( j ) ϵ .
Then, for any pair of distributions ( P 1 , P 2 ) P ( X ) × P U ( X ) and for all pairs of distributions ( P ˜ 1 , P ˜ 2 ) P ( X ) × P U ( X ) , we show
β 1 ( Λ n | P ˜ 1 , P ˜ 2 ) exp ( n λ n r ) ,
lim sup n β 2 ( Λ n | P 1 , P 2 ) ϵ .
The test Λ n defined by (A23) is the same as the test defined by (17) with replacing λ ˜ with λ ˜ + r n and s with 1, and so (A25) can be derived from the argument in Section 3.3.1. Therefore, we will prove only (A26). We use the set B n ( Q ) , Q P ( X ) defined by (37). An upper bound on the type-II error probability of the test Λ n , defined by (A23), for any pair of distributions ( P 1 , P 2 ) P ( X ) × P U ( X ) can be evaluated as follows:
lim sup n β 2 ( Λ n | P 1 , P 2 ) = lim sup n P 2 GJS ( q t 1 , q x , γ ) λ + r n + η ( n ) = lim sup n j = 1 u w ( j ) P 2 j GJS ( q t 1 , q x , γ ) λ + r n + η ( n ) lim sup n j = 1 u w ( j ) P 2 j GJS ( q t 1 , q x , γ ) λ + r n + η ( n ) , t 1 B N ( P 1 ) , x B n ( P 2 i ) + lim sup n j = 1 u w ( j ) P 2 j t 1 B N ( P 1 ) or x B n ( P 2 i ) = lim sup n j = 1 u w ( j ) P 2 j { 1 n i = 1 N ι 1 ( t 1 , i | P 1 , P 2 j , γ ) + 1 n i = 1 n ι 2 ( x i | P 1 , P 2 j , γ ) λ + r n
+ O log n n , t 1 B N ( P 1 ) , x B n ( P 2 i ) } lim sup n j = 1 u w ( j ) P 2 j { 1 n i = 1 N ι 1 ( t 1 , i | P 1 , P 2 j , γ ) + 1 n i = 1 n ι 2 ( x i | P 1 , P 2 j , γ ) λ + r n + O log n n } = lim sup n j = 1 u w ( j ) P 2 j { i = 1 N ι 1 ( t 1 , i | P 1 , P 2 j , γ ) E P 1 [ ι 1 ( X | P 1 , P 2 j , γ ) ] + i = 1 n ι 2 ( x i | P 1 , P 2 j , γ ) E P 2 [ ι 2 ( X | P 1 , P 2 j , γ ) ] n ( λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n ) }
where (A27) follows from Lemma 3 and the proof of (A28) is provided in Appendix F.
lim sup n β 2 ( Λ n | P 1 , P 2 ) lim sup n j = 1 u w ( j ) { G λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n n V ( P 1 , P 2 j , γ )
+ 6 T ( P 1 , P 2 j , γ ) n ( V ( P 1 , P 2 j , γ ) ) 3 } = lim sup n j = 1 u w ( j ) G λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n n V ( P 1 , P 2 j , γ ) j = 1 u w ( j ) lim sup n G λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n n V ( P 1 , P 2 j , γ ) ,
where (A29) and the last inequality are derived from Lemma A1 and Fatou’s lemma, respectively. Here,
lim sup n G λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n n V ( P 1 , P 2 j , γ ) = 1 for j U : GJS ( P 1 , P 2 j , γ ) < λ G r V ( P 1 , P 2 j , γ ) for j U : GJS ( P 1 , P 2 j , γ ) = λ 0 for j U : GJS ( P 1 , P 2 j , γ ) > λ .
Therefore,
lim sup n β 2 ( Λ n | P 1 , P 2 ) j U : GJS ( P 1 , P 2 j , γ ) < λ w ( j ) + j U : GJS ( P 1 , P 2 j , γ ) = λ G r V ( P 1 , P 2 j , γ ) w ( j ) = j U : GJS ( P 1 , P 2 j , γ ) < λ w ( j ) + j U : GJS ( P 1 , P 2 j , γ ) = λ Φ ( r ) w ( j ) .
Thus, by (A24), we can see that
lim sup n β 2 ( Λ n | P 1 , P 2 ) ϵ .

Appendix E.2. Converse Part

For any pair of distributions ( P 1 , P 2 ) P ( X ) × P U ( X ) and for all pairs of distributions ( P ˜ 1 , P ˜ 2 ) P ( X ) × P U ( X ) , fix any test ϕ n satisfying that
β 1 ( ϕ n | P ˜ 1 , P ˜ 2 ) exp ( n λ n r ) ,
lim sup n β 2 ( ϕ n | P 1 , P 2 ) ϵ .
We show
r sup r | j U : GJS ( P 1 , P 2 j , γ ) < λ w ( j ) + j U : GJS ( P 1 , P 2 j , γ ) = λ Φ j ( r ) w ( j ) ϵ .
To prove (A36), we use Corollary 1 with replacing λ with λ + r n . A lower bound on the type-II error probability of the test ϕ n for any pair of distributions ( P 1 , P 2 ) P ( X ) × P U ( X ) can be evaluated as follows:
lim sup n β 2 ( ϕ n | P 1 , P 2 ) lim sup n 1 1 n j = 1 u w ( j ) P 2 j GJS ( q t 1 , q x , γ ) + ρ ( n ) + O log n n < λ + r n lim sup n 1 1 n j = 1 u w ( j ) P 2 j { GJS ( q t 1 , q x , γ ) + ρ ( n ) + O log n n < λ + r n ,
t 1 B N ( P 1 ) , x B n ( P 2 j ) } = lim sup n 1 1 n j = 1 u w ( j ) P 2 j 1 n i = 1 N ι 1 ( t 1 , i | P 1 , P 2 j , γ ) + 1 n i = 1 n ι 2 ( x i | P 1 , P 2 j , γ ) < λ + r n
+ O log n n lim sup n P 2 j t 1 B N ( P 1 ) , x B n ( P 2 j ) lim sup n 1 1 n j = 1 u w ( j ) { G λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n n V ( P 1 , P 2 j , γ )
6 T ( P 1 , P 2 j , γ ) n ( V ( P 1 , P 2 j , γ ) ) 3 } = lim sup n j = 1 u w ( j ) { G λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n n V ( P 1 , P 2 j , γ ) lim inf n j = 1 u w ( j ) G λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n n V ( P 1 , P 2 j , γ ) j = 1 u w ( j ) lim inf n G λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n n V ( P 1 , P 2 j , γ ) ,
where (A37) and (38) follow from Lemma 3 and Lemma A1, respectively. The last inequality is due to Fatou’s lemma. Here,
lim inf n G λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n n V ( P 1 , P 2 j , γ ) = 1 for j U : GJS ( P 1 , P 2 j , γ ) < λ G r V ( P 1 , P 2 j , γ ) for j U : GJS ( P 1 , P 2 j , γ ) = λ 0 for j U : GJS ( P 1 , P 2 j , γ ) > λ .
Therefore,
lim sup n β 2 ( ϕ n | P 1 , P 2 ) j U : GJS ( P 1 , P 2 j , γ ) < λ w ( j ) + j U : GJS ( P 1 , P 2 j , γ ) = λ G r V ( P 1 , P 2 j , γ ) w ( j ) = j U : GJS ( P 1 , P 2 j , γ ) < λ w ( j ) + j U : GJS ( P 1 , P 2 j , γ ) = λ Φ ( r ) w ( j ) .
In view of (A35), Equation (A41) indicates that
r sup r | j U : GJS ( P 1 , P 2 j , γ ) < λ w ( j ) + j U : GJS ( P 1 , P 2 j , γ ) = λ Φ j ( r ) w ( j ) ϵ .

Appendix F

Proof of Equation (A28)).
We shall show
1 n i = 1 N ι 1 ( t 1 , i | P 1 , P 2 j , γ ) + 1 n i = 1 n ι 2 ( x i | P 1 , P 2 j , γ ) λ + r n + O log n n = 1 n i = 1 N ι 1 ( t 1 , i | P 1 , P 2 j , γ ) E P 1 [ ι 1 ( X | P 1 , P 2 j , γ ) ] + 1 n i = 1 n { ι 2 ( x i | P 1 , P 2 j , γ ) E P 2 [ ι 2 ( X | P 1 , P 2 j , γ ) ] } λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n .
The left-hand side of (A43) can be rewritten as follows:
1 n i = 1 N ι 1 ( t 1 , i | P 1 , P 2 j , γ ) + 1 n i = 1 n ι 2 ( x i | P 1 , P 2 j , γ ) λ + r n + O log n n 1 n i = 1 N ι 1 ( t 1 , i | P 1 , P 2 j , γ ) γ E P 1 [ ι 1 ( X | P 1 , P 2 j , γ ) ] + 1 n i = 1 n ι 2 ( x i | P 1 , P 2 j , γ ) E P 2 [ ι 2 ( X | P 1 , P 2 , γ ) ] λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n 1 n i = 1 N ι 1 ( t 1 , i | P 1 , P 2 j , γ ) 1 n i = 1 N E P 1 [ ι 1 ( X | P 1 , P 2 j , γ ) ] + 1 n i = 1 n ι 2 ( x i | P 1 , P 2 j , γ )
1 n i = 1 n E P 2 [ ι 2 ( X | P 1 , P 2 , γ ) ] λ + r n GJS ( P 1 , P 2 j , γ ) + O log n n ,
where (A44) follows from (47). Therefore, we obtain Equation (A28). □

References

  1. Merhav, N.; Ziv, J. A Bayesian approach for classification of Markov sources. IEEE Trans. Inf. Theory 1991, 37, 1067–1071. [Google Scholar] [CrossRef]
  2. Saito, S.; Matsushima, T. Evaluation of error probability of classification based on the analysis of the Bayes code. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 21–26. [Google Scholar]
  3. Gutman, M. Asymptotically optimal classification for multiple tests with empirically observed statistics. IEEE Trans. Inf. Theory 1989, 35, 401–408. [Google Scholar] [CrossRef]
  4. Hsu, H.-W.; Wang, I.-H. On binary statistical classification from mismatched empirically observed statistics. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 2538–3533. [Google Scholar]
  5. Zhou, L.; Tan, V.Y.F.; Motani, M. Second-order asymptotically optimal statistical classification. Inf. Inference J. IMA 2020, 9, 81–111. [Google Scholar]
  6. Han, T.S.; Nomura, R. First- and second-order hypothesis testing for mixed memoryless sources. Entropy 2018, 20, 174. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
  8. Polyanskiy, Y.; Poor, H.V.; Verdú, S. Channel coding rate in the finite blocklength regime. IEEE Trans. Inf. Theory 2010, 56, 2307–2359. [Google Scholar] [CrossRef]
  9. Yagi, H.; Han, T.S.; Nomura, R. First- and second-order coding theorems for mixed memoryless channels with general mixture. IEEE Trans. Inf. Theory 2016, 62, 4395–4412. [Google Scholar] [CrossRef]
  10. Ziv, J. On classification with empirically observed statistics and universal data compression. IEEE Trans. Inf. Theory 1988, 34, 278–286. [Google Scholar] [CrossRef]
  11. Kelly, B.G.; Wagner, A.B.; Tularak, T.; Viswanath, P. Classification of homogeneous data with large alphabets. IEEE Trans. Inf. Theory 2013, 59, 782–795. [Google Scholar] [CrossRef] [Green Version]
  12. Unnikrishnan, J.; Huang, D. Weak convergence analysis of asymptotically optimal hypothesis tests. IEEE Trans. Inf. Theory 2016, 62, 4285–4299. [Google Scholar] [CrossRef]
  13. He, H.; Zhou, L.; Tan, V.Y.F. Distributed detection with empirically observed statistics. IEEE Trans. Inf. Theory 2020, 66, 4349–4367. [Google Scholar] [CrossRef]
  14. Saito, S.; Matsushima, T. Evaluation of error probability of classification based on the analysis of the Bayes code: Extension and example. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, VIC, Australia, 12–20 July 2021; pp. 1445–1450. [Google Scholar]
  15. Csiszár, I. The method of types. IEEE Trans. Inf. Theory 1998, 44, 2505–2523. [Google Scholar] [CrossRef] [Green Version]
  16. Nielsen, F. On a variational definition for the Jensen-Shannon symmetrization of distances based on the information radius. Entropy 2021, 23, 464. [Google Scholar] [CrossRef] [PubMed]
Figure 1. System model.
Figure 1. System model.
Entropy 24 00635 g001
Figure 2. The first-order maximum type-I error exponent λ ^ ( ϵ ) ( 0 < γ 2 ).
Figure 2. The first-order maximum type-I error exponent λ ^ ( ϵ ) ( 0 < γ 2 ).
Entropy 24 00635 g002
Figure 3. The first-order maximum type-I error exponent λ ^ ( ϵ ) ( γ = 2 ).
Figure 3. The first-order maximum type-I error exponent λ ^ ( ϵ ) ( γ = 2 ).
Entropy 24 00635 g003
Figure 4. The second-order maximum type-I error exponent r ^ ( ϵ , λ ) .
Figure 4. The second-order maximum type-I error exponent r ^ ( ϵ , λ ) .
Entropy 24 00635 g004
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kuramata, H.; Yagi, H. Analysis on Optimal Error Exponents of Binary Classification for Source with Multiple Subclasses. Entropy 2022, 24, 635. https://doi.org/10.3390/e24050635

AMA Style

Kuramata H, Yagi H. Analysis on Optimal Error Exponents of Binary Classification for Source with Multiple Subclasses. Entropy. 2022; 24(5):635. https://doi.org/10.3390/e24050635

Chicago/Turabian Style

Kuramata, Hiroto, and Hideki Yagi. 2022. "Analysis on Optimal Error Exponents of Binary Classification for Source with Multiple Subclasses" Entropy 24, no. 5: 635. https://doi.org/10.3390/e24050635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop