Analysis on Optimal Error Exponents of Binary Classification for Source with Multiple Subclasses

We consider a binary classification problem for a test sequence to determine from which source the sequence is generated. The system classifies the test sequence based on empirically observed (training) sequences obtained from unknown sources P1 and P2. We analyze the asymptotic fundamental limits of statistical classification for sources with multiple subclasses. We investigate the first- and second-order maximum error exponents under the constraint that the type-I error probability for all pairs of distributions decays exponentially fast and the type-II error probability is upper bounded by a small constant. In this paper, we first give a classifier which achieves the asymptotically maximum error exponent in the class of deterministic classifiers for sources with multiple subclasses, and then provide a characterization of the first-order error exponent. We next provide a characterization of the second-order error exponent in the case where only P2 has multiple subclasses but P1 does not. We generalize our results to classification in the case that P1 and P2 are a stationary and memoryless source and a mixed memoryless source with general mixture, respectively.


Background
The problem of learning sources from training sequences and estimating the source from which a test sequence is generated is known as a classification problem. Recently, this problem has been actively studied in a field such as machine learning, and it is desirable to conduct studies that guarantee the performance of the system. In the field of information theory, studies have been conducted mainly to analyze the performance of optimal tests. When the number of sources is two, the binary classification problem can be regarded as a binary hypothesis testing problem, using training sequences. In the setting of binary hypothesis testing, it is assumed that the sources are known, but in real-world applications, the sources are not known in general. Therefore, it is of importance to consider the binary classification problem.
Hypothesis testing includes approaches such as the Bayesian test [1,2] and the Neyman-Pearson test [3][4][5]. In this paper, we take the latter approach to formulate the best asymptotic error exponent (the exponential part of an error probability).
There are a lot of studies related to the classification problem. We state important points, which are deeply connected to this study in some previous studies. Gutman [3] has shown that type-based (empirical distribution-based) tests asymptotically achieve the maximum type-II error exponent for stationary Markov sources, while the type-I error probability exponentially converges to zero as the length of a test sequence goes to infinity. Zhou et al. [5] derived second-order approximations of the maximum type-I error exponent for stationary and memoryless sources when the type-II error probability is upper bounded by a small constant. On the other hand, for the hypothesis testing problem, Han and Nomura [6] characterized a first-order maximum error exponent when each sequence is generated from a mixed memoryless source, which is a mixture of stationary and memoryless sources. In addition, they also characterized a second-order maximum error exponent in the case where one source is a stationary and memoryless source and the other source is a mixed memoryless source.

Contributions
In this paper, we investigate the binary classification problem for stationary memoryless sources with multiple subclasses. The class of sources with multiple subclasses is important in binary classification because there are many such settings in real-world applications. For example, newspaper articles with science headlines consist of topics of physics, chemistry, biology, etc. We assume that sources (subclasses) are characterized by a mixture with some unknown prior distribution (cf. Equation (2)), and the overall sources can be regarded as mixed memoryless sources [6]. The purpose of this paper is to characterize the first-and second-order maximum error exponents in a single-letter form (the term "single-letter form" means an expression which does not depend on lengths of sequences n or N (cf. the formulas for error exponents in Theorems 2-4)).
To this end, we generalize Gutman's classifier [3], which was shown to be first-and second-order optimal for memoryless sources (with no multiple subclasses) in [5]. This classifier uses training sequences from one of the two sources, as in [3][4][5], making a type-based decision for a source (subclass) with the smallest skewed Jensen-Shannon divergence [7] among the subclasses. We show that this classifier asymptotically achieves the maximum type-II error exponent in the class of deterministic classifiers for a given pair of distributions when the type-I error probability decays exponentially fast for all pairs of distributions in Theorem 1. We also demonstrate that the structure of this classifier leads to a reversed and more relaxed relation; the maximization of the type-I error exponent when the type-II error probability is upper bounded by a small constant (0 ≤ < 1) for sources with multiple subclasses in Theorem 2. In addition, using the Berry-Esseen theorem [8], we derive the second-order maximum error exponent in the case where only one of sources has subclasses in Theorem 3. Finally, the fact that the classifier uses the test sequence from one of two sources motivates us to consider a more general case; the first source is a source with no multiple subclasses, but the second source is given by a general mixture [6,9]. That is, the number of subclasses is not necessarily finite, and the prior distribution of subclasses may not be discrete (cf. Equation (75)). We give characterizations of the first-and second-order maximum error exponents in Theorem 4.

Related Work
Ziv [10] proposed a classifier based on empirical entropy and discussed the relationship between binary classification and universal source coding. Hsu and Wang [4] characterized the maximum error exponent with mismatched empirically observed statistics. In their achievability proof, a generalization of Gutman's classifier is also used. Kelly et al. [11] investigated binary classification with large alphabets. Unnikrishnan and Huang [12] investigated the type-I error probability of binary classification using the analysis of weak convergence. Generalizing the binary classification problem, He et al. [13] discussed the binary distribution detection problem, in which a different generalization of Gutman's classifier is also discussed.
There are also studies which take a Bayesian approach. Merhav and Ziv [1] analyzed the weighted sum of type-I and type-II error probabilities, and subsequently, Saito and Matsushima [2,14] gave a different result via the analysis for the Bayes code.

Organization
The rest of this paper is organized as follows: In Section 2, we define the notation used in this paper and describe the details of the source and system model. Moreover, we state the problem setting, defining the first-and second-order maximum error exponents.
In Section 3, we first give a classifier which achieves the asymptotically maximum error exponent in the class of deterministic classifiers for sources with multiple subclasses. Next, we characterize the first-and second-order maximum error exponents and the detailed proofs for the first-order representation. In Section 4, we generalize the obtained results to the classification of a mixed memoryless source with general mixture. In Section 5, we present numerical examples. Finally, in Section 6, we provide some concluding remarks and future work.

Notation
The set of non-negative real numbers is denoted by R + . Calligraphic X stands for a finite alphabet. Upper-case X denotes a random variable taking values in X , and lower-case x ∈ X denotes its realization. Throughout this paper, logarithms are of base e. For integers a and b such that a ≤ b, [a, b] denotes the set {a, a + 1, · · · , b}. The set of all probability distributions on a finite set X is denoted as P (X ). Notation regarding the method of types [15] is as follows: Given a vector x n 1 = (x 1 , x 2 , · · · , x n ) ∈ X n , the type is denoted as The set of types formed from length-n sequences with alphabet X is denoted as P n (X ). The probability that n independent drawings from a probability distribution Q ∈ P (X ) give x ∈ X n is denoted by Q(x).

Source with Multiple Subclasses
Consider a source consisting of multiple subclasses. Each subclass is distributed according to a given probability (weight). Let {P i 1 } i∈S be a family of probability distributions on a finite alphabet X , where S = {1, · · · , s} is a probability space with probability measure v(i), i ∈ S. That is, the probability of x ∈ X n is given by where the i-th subclass P i 1 is a stationary and memoryless source. That is, for x = (x 1 , x 2 , · · · , x n ) ∈ X n , (for notational simplicity, we denote both the multi-letter and single-letter probabilities by P i 1 with a slight abuse of notation). In view of (2), the sequence x can be regarded as an output from a mixed memoryless source P 1 (·), and it is called a test sequence. Similarly, let {P i 2 } i∈U be a family of probability distributions, where U = {1, · · · , u} is a probability space with probability measure w(i), i ∈ U . For these mixed sources, if the sources are known, that is, the addressed problem is hypothesis testing, the first-and second-order error exponents were analyzed by Han and Nomura [6]. In this paper, we assume that the sources are unknown and training sequences are available to learn about the source. Sets of training sequences are denoted by t 1 = {t 1 1 , · · · , t s 1 } and t 2 = {t 1 2 , · · · , t u 2 }, where t j i ∈ X N of length N is output from subclass j and N = nγ for some fixed γ ∈ R + . Then, the joint probabilities of training sequences are, respectively, We define the class of sources with multiple subclasses on a probability space S with probability measure v(·) as which means P 1 ∈ P S (X ), where the set of weights {v(i)} i∈S is implicitly fixed. Similarly, we define the class of sources with multiple subclasses on a probability space U with probability measure w(·) as which means P 2 ∈ P U (X ).

System Model
The binary classification problem assumed in this paper is shown in Figure 1. It consists of two phases: (I) learning phase and (II) classification phase. We explain the details of each phase. (I) Learning phase: Determine the classifier by learning with the training sequences t 1 = {t 1 1 , · · · , t s 1 } and t 2 = {t 1 2 , · · · , t u 2 } generated from unknown source P 1 ∈ P S (X ) and P 2 ∈ P U (X ), respectively.
(II) Classification phase: It represents the phase in which we judge whether the test sequence x ∈ X n generated from P 1 ∈ P S (X ) or P 2 ∈ P U (X ) according to the classifier determined in (I).

Maximum Error Exponent
In this section, we define two error probabilities that arise in a binary classification problem and formulate the maximum error exponents. In the binary classification problem, a test is described as a partition of the space φ n : X n × X (s+u)N → {1, 2}. The type-I and type-II error probabilities of a given test φ n are denoted as β 1 (φ n |P 1 , P 2 ) and β 2 (φ n |P 1 , P 2 ), respectively. That is, Here, P θ {·} is the joint probability of training and testing sequences when the underlying parameter is θ, given by We consider the problem of maximizing the type-I error exponent when the type-II error probability is upper bounded by a small constant ∈ [0, 1). In this study, we characterize the following quantities (the first-and second-order maximum type-I error exponent).
Definition 2 (Second-order maximum error exponent). For any pair of distributions P = (P 1 , P 2 ) ∈ P S (X ) × P U (X ) and ∈ [0, 1), we definê where the weights ofP = (P 1 ,P 2 ) ∈ P S (X ) × P U (X ) are the sames as the weights of P = (P 1 , P 2 ) ∈ P S (X ) × P U (X ). Remark 1. In (12) and (13), the type-I error probability is constrained for anyP ∈ P S (X ) × P U (X ) for technical reasons. In more detail, this condition is required in the proof of the converse part. This condition was also imposed by Gutman [3], Hsu and Wang [4], and Zhou et al. [5].
In Definitions 1 and 2, we focus on universal tests that perform well for all pairs of distributions with respect to the type-I error probability, and at the same time, constrain the type-II error probability with respect to a particular pair of distributions. We obtain the same result when the weights of (P 1 ,P 2 ) ∈ P S (X ) × P U (X ) are not fixed.

A Test to Achieve Maximum Error Exponent
The rule for estimating whether a test sequence generated from P 1 ∈ P S (X ) or P 2 ∈ P U (X ) is called a decision rule. One of the goals of the classification problem is to design an optimal decision rule which achieves a maximum error exponent based on training sequences. In this section, we present a decision rule that asymptotically achieves the maximum type-II error exponent for any pair of distributions when the type-I error exponent is lower bounded by a constant for all pairs of distributions (cf. Theorem 1).
To define a test that is asymptotically optimum, we define two generalizations of the Jensen-Shannon divergence. These generalizations are related to some variational definitions in [7,16]. For any pair of distributions (Q 1 , Q 2 ) ∈ P (X ) 2 and any number γ ∈ R + , let the generalized Jensen-Shannon divergence be where D(p||q) denotes the Kullback-Leibler divergence for p ∈ P (X ) and q ∈ P (X ) defined as The generalized Jensen-Shannon divergence GJS(Q 1 , Q 2 , γ) corresponds to a skewed α-Jensen-Shannon divergence for α = γ 1+γ . Additionally, for (Q 1 , Q 2 ) ∈ P S (X ) × P (X ), we define the minimized generalized Jensen-Shannon divergence by Given a threshold λ ∈ R + (including λ = 0), the decision rule to achieve the maximum error exponent is given by is the set of (x, t 1 , t 2 ) determined to be class i by the test Λ n . By definition, the discriminant function MGJS(q t 1 , q x , γ), appearing on the right-hand side of (17), can also be expressed as where y i 1 := xt i 1 . From (17) and (18), Λ n is a type-based test and implicitly depends on λ. In addition, this test uses training sequences asymmetrically; only sequence t 1 is used, but not t 2 (cf. refs. [3][4][5]). Theorem 1. For any given λ ∈ R + and any sequence of tests {φ n } such that β 1 (φ n |P) ≤ exp(−nλ), ∀P = (P 1 ,P 2 ) ∈ P S (X ) × P U (X ), the sequence of tests {Λ n } given by (17) satisfies for any pair of distributions P = (P 1 , P 2 ) ∈ P S (X ) × P U (X ), Proof. Equation (19) is derived in Section 3.3.1. The proof of (20) follows from Corollary 1.
Although there is a deviation between the exponents in Corollary 1 and for the test Λ n , the deviation vanishes asymptotically.
Theorem 1 shows that the test Λ n can asymptotically achieve the maximum type-II error exponent among the tests φ n for which the type-I error exponent is greater than or equal to λ. This test also has a reversed and more relaxed property; it achievesλ( ), the maximum type-I error exponent when the type-II error probability is upper bounded by a constant (0 ≤ < 1) (see the achievability proof of Theorem 2 in Section 3.3.1).

First-Order Maximum Error Exponent
In this section, we characterize the first-order maximum error exponent in a singleletter form for sources with multiple subclasses.
Proof. The proof is provided in Section 3.3.

Remark 2.
If S and U are singletons (that is, s = 1 and u = 1), Theorem 2 reduces to the following formula given by Zhou et al. [5]: which means thatλ( ) does not depend on and the strong converse holds in this case, unlike in the case |S|, |U | > 1. On the other hand, for general S and U but in the spacial case of = 0, formula (21) reduces toλ

Proof of Theorem 2
We divide the proof of Theorem 2 into two parts: the achievability (direct) part and the converse part.

Achievability Part
In the achievability proof, we use the type-based test Λ n given by (17). Fix any Then, for any pair of distributions (P 1 , P 2 ) ∈ P S (X ) × P U (X ) and for all pairs of distributions (P 1 ,P 2 ) ∈ P S (X ) × P U (X ), we show First, we prove (25). For preliminaries, we define the following sets used in the proof: whereΛ n 2 is the projection of Λ n 2 ⊆ X n × X sN × X uN onto the space X n × X sN . To evaluate the probability of a source sequence being in T n (q x ), the following relationship holds from the method of types [15]. Lemma 1. Suppose that the sequence x is sampled independently from the source P ∈ P (X ). Then, where P n (X ) denotes the set of types formed from length-n sequences with alphabet X and |P n (X )| ≤ (n + 1) |X | .
Then, an upper bound on the type-II error probability of the test Λ n = (Λ n 1 , Λ n 2 ) for all pairs of distributions (P 1 ,P 2 ) ∈ P S (X ) × P U (X ) can be evaluated as follows: where (31) is derived from (4), (32) follows from Lemma 1 and (33) is derived in Appendix A.
Next, we demonstrate (26). For preliminaries, we define a new typical set and show some properties for the proof. For any given Q ∈ P (X ), define the following typical set: By using [6, Lemma 22], for t 1 ∼ Q 1 , x ∼ Q 2 , generated from memoryless sources (Q 1 , Q 2 ) ∈ P (X ) 2 , we have For any given x ∈ X and any pair of distributions (Q 1 , Q 2 ) ∈ P (X ) 2 , we define information density as Furthermore, for any pair of distributions (Q 1 , Q 2 ) ∈ P S (X ) × P (X ), define the function i * (Q 1 , Q 2 ), as the index of the subclass of Q 1 , as follows: Hereafter, we denote i * (Q 1 , Q 2 ) simply as i * (Q 2 ) when the first argument is clear from the context.
Proof. The proof is provided in Appendix B.
Lemma 3 (Zhou et al. [5]). Assume that (Q 1 , where t i 1,m denotes the m-th symbol of sequence t i 1 and x m denotes the m-th symbol of sequence x. Note that the probability P 2 is calculated by assuming that the test sequence x is generated from P 2 (cf. Equation (11)). An upper bound on the type-II error probability can be evaluated as follows: where Equation (43) follows from Equations (14), (38) and (40). By using Lemma 3, Equation (43) can be expanded as follows: where Equation (44) is derived from Fatou's lemma. It follows from (38) that Here, Equation (14) can be also expressed as follows: Therefore, by the weak law of large numbers, for any given δ > 0, From this result of the weak law of large numbers and Lemma 2, combining (45)-(48) gives lim sup Thus, by (24), we can see that lim sup n→∞ β 2 (Λ n |P 1 , P 2 ) ≤ .

Converse Part
For any pair of distributions (P 1 , P 2 ) ∈ P S (X ) × P U (X ) and for all pairs of distributions (P 1 ,P 2 ) ∈ P S (X ) × P U (X ), fix any test φ n satisfying that We show We first give some lemmas, which are useful in the proof of the converse part.

Lemma 4.
Let φ n be a test in which the decision rule depends only on (x, t 1 , t 2 ) ∈ X n × X sN × X uN . Then, for any given κ ∈ [0, 1], we can construct a type-based test Ω n satisfying for any pair of distributions (P 1 ,P 2 ) ∈ P S (X ) × P U (X ).

Remark 3.
As in the proof of (Lemma 2 [3]) and (Lemma 7 [5]), a type-based test Ω n specified in Lemma 4 is obtained by tailoring φ n and satisfies Equations (54) and (55) for all (P 1 ,P 2 ) ∈ P S (X ) × P U (X ). In other words, the construction of Ω n is universal, which is in the same spirit of (Lemma 2 [3]). This claim is slightly stronger than the one in (Lemma 7 [5]).

Lemma 5.
For any λ ∈ R + , any type-based test Ω n satisfying the condition that for all pairs of distributions (P 1 ,P 2 ) ∈ P S (X ) × P U (X ), we have that for any pair of distributions (P 1 , P 2 ) ∈ P S (X ) × P U (X ) where ρ(n) := Proof. The proof is provided in Appendix C.
The type-based test Ω n specified in Lemma 4 satisfies Equations (54) and (55) for all (P 1 ,P 2 ) ∈ P S (X ) × P U (X ). If we set κ = 1/n in Lemma 4, and combine it with Lemma 5, we can derive the following relation: Corollary 1. For any given λ ∈ R + , any test φ n satisfying the condition that for all pairs of distributions (P 1 ,P 2 ) ∈ P S (X ) × P U (X ) we have that for any pair of distributions (P 1 , P 2 ) ∈ P S (X ) × P U (X ) Proof. The proof is provided in Appendix D.
By (59), a lower bound on the type-II error probability can be evaluated as follows: where Equations (60) and (61) (63)

Second-Order Maximum Error Exponent
In this section, we characterize the second-order maximum error exponent. For simplicity, we assume that only P 2 has subclasses, but P 1 does not (s = 1). First, from Theorem 2 with s = 1, the first-order maximum error exponent in this setting is characterized as follows: for any pair of distributions (P 1 , P 2 ) ∈ P (X ) × P U (X ), we havê where MGJS(P 1 , P j 2 , γ) in (21) is replaced by GJS(P 1 , P j 2 , γ). Next, we provide a characterization of the second-order maximum error exponent in Definition 2 with s = 1 in the case where only P 2 has subclasses. By definition,r( , λ) = +∞ if λ <λ( ) andr( , λ) = −∞ if λ >λ( ). Therefore, in the discussion of the second-order error exponent, we focus on the case λ =λ( ). Theorem 3. For any pair of distributions (P 1 , P 2 ) ∈ P (X ) × P U (X ) and ∈ [0, 1), which is the cumulative distribution function of the standard Gaussian distribution, and for any pair of distributions (Q 1 , Q 2 ) ∈ P (X ) 2 , where Var Q [·] represents the variances with respect to Q ∈ P (X ).
Proof. The proof is provided in Appendix E.

Remark 5.
We can summarize the two terms on the right-hand side of (65) into the following single term called the canonical equation [6]: We focus on the case λ =λ( ). From Theorem 2, it holds that

Generalization to Mixed Memoryless Sources with General Mixture
In this section, we consider the classification problem in the case where P 1 does not have subclasses and P 2 is given by a general mixture model. The general mixture model considered in this problem represents an extension of the source with multiple subclasses defined in Section 2.2. Since the decision rule that achieves the maximum error exponent can be operated using only one of the training sequences, we assume in this section that only the training sequence t 1 is available. Then, we provide a characterization of the maximum error exponents in a single-letter form under this setting. First, we define the source referred to as a mixed memoryless source with general mixture [6,9] as follows. Let Θ be an arbitrary probability space with a general probability measure w(θ), θ ∈ Θ. Then, the probability of x ∈ X n is given by where P θ 2 is a stationary and memoryless source for each θ ∈ Θ. That is, for When a test sequence is output from P 2 , the probability distribution of the sequence takes the form of (75). Here, type-I and type-II error probabilities of a test φ n = (φ n 1 , φ n 2 ) are given by Proof. We can prove this theorem in the same way as Theorems 2 and 3.

First-Order Maximum Error Exponent
In this section, we present a numerical example to illustrate the first-order maximum type-I error exponentλ( ) characterized in Theorem 2.
A numerical example of the first-order maximum error exponent is given by calculating the right-hand side of (64) for the following settings. We assume that X = {0, 1}. We fix the set of probabilities and weights and where Bern(·) denotes the Bernoulli distribution. The relation amongλ( ), and γ is shown in Figure 2. Additionally, for γ = 2, the behavior ofλ( ) is depicted in Figure 3. When becomes larger, the value ofλ( ) also increases like a step function. The step increases when = 1 6 and = 1 2 . We can also confirm thatλ( ) is right-continuous in . This is due to the limit that the superior of the type-II error probability is constrained in Definition 1.

Second-Order Maximum Error Exponent
As in the previous subsection, we present a numerical example to illustrate the secondorder maximum type-I error exponentr( , λ) characterized in Theorem 3.
A numerical example of the second-order maximum error exponent is given by calculating the right-hand side of (65) for the following settings. We assume that γ = 2, X = {0, 1}. We fix λ =λ( ) and the set of probabilities and weights and where P 2 is the same as the setting in the previous subsection. The behavior ofr( , λ) is shown in Figure 4. The value ofr( , λ) takes the inverse of the cumulative distribution function of the standard Gaussian for each interval of such that 0 ≤ < 1 6 , 1 6 ≤ < 1 2 and 1 2 ≤ < 1. In contrast to the first-orderλ( ),r( , λ) is no longer right continuous in .

Conclusions
For binary classification of sources with multiple subclasses, we characterized the firstand second-order maximum error exponents. First, we revealed the first-order maximum error exponent in the case where P 1 and P 2 are sources with multiple subclasses. In order to derive this representation, we gave a classifier which achieves the asymptotically maximum error exponent in the class of deterministic classifiers for sources with multiple subclasses.
Next, we showed the second-order maximum error exponent in the case where only one of sources has subclasses. The most important key technique to derive the second-order maximum error exponent is to apply the Berry-Esseen theorem [8] instead of the weak law of large numbers. One may wonder whether we can also derive the second-order approximation in the case where P 1 is also a source with multiple subclasses. To this end, we need to evaluate Lemma 2 more rigorously. This is future work.
In addition, for binary classification using only a training sequence generated from P 1 in the case where P 1 does not have subclasses and P 2 is given by a general mixture model, we generalized the analysis for the first-and second-order error exponents. From these results, we revealed the asymptotic performance limits of statistical classification for sources with multiple subclasses.
In this paper, we considered a binary classification problem, but in practice, multiclass classification is of importance. In the case where each class is a memoryless source (without multiple subclasses), the first-and second-order maximum error exponents were analyzed in [5]. Extending the obtained results to multiclass classification for sources with multiple subclasses is also a subject of future studies.

Conflicts of Interest:
The authors declare no conflict of interest.

Proof of Equation (33). We shall show
The left-hand side of (A1) is expanded as follows: Therefore, we obtain Equation (33).

Appendix B
Proof of Lemma 2. Applying the Taylor expansion to GJS(q t i Therefore by (14) for any q x ∈ B n (Q 2 ), q t i holds, and the convergence is uniform in i ∈ S. Since for any (Q 1 , Q 2 ) ∈ P S (X ) × P (X ), i * (Q 1 , Q 2 ) was given in the form we obtain Appendix C Proof of Lemma 5. It follows from (56) that Using (30) on the right-hand side of (A7), we obtain exp{−nλ} where ρ(n) = |X | log(n+1)+(s+u)|X | log(N+1)−log v * n . Taking the negative logarithm of both sides and divide by n in (A8), we have Since (P i 1 ,P r 1 ,P j 2 ) ∈ P (X ) 3 is arbitrary in (A9), we can set the i-th subclass as P i 1 = q xt i 1 = q y i 1 and the other asP r 1 = q t r 1 , r = i. Furthermore, we setP Then we obtain Therefore by (18), (A10) implies that Ω n 2 is constrained in and for any pair of distributions P = (P 1 , P 2 ) ∈ P S (X ) × P U (X ), we obtain

Proof of Theorem 3.
We divide the proof of Theorem 3 into two parts: the achievability (direct) part and the converse part. First, for preliminaries, we define the following quantity used in the proof: For (Q 1 , Q 2 ) ∈ P (X ) 2 , Next, we give a lemma that is important in the proof of Theorem 3.
The test Λ n defined by (A23) is the same as the test defined by (17) with replacingλ withλ + r √ n and s with 1, and so (A25) can be derived from the argument in Section 3.3.1. Therefore, we will prove only (A26). We use the set B n (Q), Q ∈ P (X ) defined by (37). An upper bound on the type-II error probability of the test Λ n , defined by (A23), for any pair of distributions (P 1 , P 2 ) ∈ P (X ) × P U (X ) can be evaluated as follows: lim sup n→∞ β 2 (Λ n |P 1 , P 2 ) = lim sup n→∞ P 2 GJS(q t 1 , q x , γ) ≤ λ + r √ n + η(n) For any pair of distributions (P 1 , P 2 ) ∈ P (X ) × P U (X ) and for all pairs of distributions (P 1 ,P 2 ) ∈ P (X ) × P U (X ), fix any test φ n satisfying that To prove (A36), we use Corollary 1 with replacing λ with λ + r √ n . A lower bound on the type-II error probability of the test φ n for any pair of distributions (P 1 , P 2 ) ∈ P (X ) × P U (X ) can be evaluated as follows: lim sup n→∞ β 2 (φ n |P 1 , P 2 ) ≥ lim sup