Conﬁdence Sets for Statistical Classiﬁcation (II): Exact Conﬁdence Sets

: Classiﬁcation has applications in a wide range of ﬁelds including medicine, engineering, computer science and social sciences among others. Liu et al. (2019) proposed a conﬁdence-set-based classiﬁer that classiﬁes a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classiﬁcation of an object into possibly more than one class, this classiﬁer guarantees a pre-speciﬁed proportion of correct classiﬁcation among all future objects. However, the classiﬁer uses a conservative critical constant. In this paper, we show how to determine the exact critical constant in applications where prior knowledge about the proportions of the future objects from each class is available. As the exact critical constant is smaller than the conservative critical constant given by Liu et al. (2019), the classiﬁer using the exact critical constant is better than the classiﬁer by Liu et al. (2019) as expected. An example is provided to illustrate the method.


Introduction
Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences among others. For overviews, the reader is referred to the books by [1][2][3][4][5]. In the recent paper, Liu et al. (2019) [6] proposed a new classifier based on confidence sets. It constructs a confidence set for the the unknown parameter c, the true class of each future object, and classifies the object as belonging to the set of classes given by the confidence set. Hence, this approach classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into potentially more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects with a pre-specified confidence γ about the randomness in the training data based on which the classifier is constructed.
However, the classifier of Liu et al. (2019) uses a conservative critical constant λ and so the resultant confidence sets may be larger than necessary. The purpose of this paper is to determine the exact critical constant λ and therefore to improve the classifier of Liu et al. (2019) in situations where one has prior knowledge about the proportions of the (infinite) future objects belonging to the k possible classes.
The layout of the paper is as follows. Section 2 gives a very brief review of the classifier of Liu et al. (2019), and then considers the determination of the exact critical constant λ under the additional knowledge/assumption given above. An illustrative example is given in Section 3 to demonstrate the advantage of the improved classifier proposed in this paper when the additional assumption holds. Section 4 contains conclusions and discussions. Finally, some mathematical details are provided in the Appendix A. As the same setting and notation as in the work by Liu et al. (2019) are used, it is recommended to read this paper in conjunction with the one by Liu et al. (2019).

Methodology
Let the p-dimensional data vector x l = (x l1 , . . . , x l p ) T denote the feature measurement on an object from the lth class, which has multivariate normal distribution N(µ l , Σ l ), l = 1, . . . , k; here, k denotes the total number of classes, which is a known number. The available training dataset is given by T = {x l1 , . . . , x ln l ; l = 1, . . . , k}, where x l1 , . . . , x ln l are i.i.d. observations from the lth class with distribution N(µ l , Σ l ), l = 1, . . . , k. The classification problem is to make inference about c, the true class of a future object, based on the feature measurement y = (y 1 , . . . , y p ) T observed on the object, which is only known to belong to one of the k classes and so follows one of the k multivariate normal distributions. In statistical terminology, c is the unknown parameter of interest that takes a possible value in the simple parameter space C = {1, . . . , k}. We emphasize that c is treated as non-random in both the work of Liu et al. (2019) and here.
A classifier that classifies an object with measurement y into one single class in C = {1, . . . , k} can be regarded as a point estimator of c. The classifier of Liu et al. (2019) provides a set C T (y) ⊆ C as plausible values of c. Depending on y and the training dataset T , C T (y) may contain only a single value, in which case y is classified into one single class given by C T (y). When C T (y) contains more than one value in C, y is classified as possibly belonging to the several classes given by C T (y). Hence, in statistical terms, the classifier uses the confidence set approach. The inherent advantage of the confidence set approach over the point estimation approach is the guaranteed 1 − α proportion of confidence sets that contain the true classes.
Specifically, the set C T (y) ⊆ C was constructed by Liu et al. (2019) as whereμ l = 1 n l ∑ n l m=1 x lm andΣ l = 1 n l −1 ∑ n l m=1 (x lm −μ l )(x lm −μ l ) T , l = 1, . . . , k, are, respectively, the usual estimators of the unknown µ l and Σ l based on the training dataset T = {x l1 , . . . , x ln l ; l = 1, . . . , k}, and λ is a suitably chosen critical constant whose determination is considered next. The intuition behind the definition of C T (y) in Equation (1) is that a future object y is likely to be from class l if and only if (y −μ l ) TΣ−1 l (y −μ l ) ≤ λ. Note that the proportion of the future confidence sets C T (y j ) (j = 1, 2, . . .) that include the true classes c j of y j (j = 1, 2, . . .) is given by lim inf N→∞ where 1 − α is a pre-specified large (close to 1) proportion, e.g., 0.95. While the constraint in Equation (2) where E y j |T denotes the conditional expectation with respect to the random variable y j conditioning on the training dataset T (or, equivalently, {(μ 1 ,Σ 1 ), . . . , (μ k ,Σ k )}).
Since the value of the expression on the left hand side of the inequality in Equation (3) (and in Equation (2) as well) depends on T and T is random, the inequality in Equation (3) cannot be guaranteed for each observed T . We therefore guarantee Equation (3) with a large (close to 1) probability γ with respect to the randomness in T : which in turn guarantees that Computer code in R was provided by Liu et al. (2019) to compute the λ that solves Equation (4), which allows the confidence sets C T (y j ) in Equation (1) to be constructed for each future object.
The interpretation of Equations (5) and (6) below is that, based on one observed training dataset T , one constructs confidence sets C T (y j ) for the c j s of all future y j (j = 1, 2, · · · ) and claims that at least 1 − α proportion of these confidence sets do contain the true c j s. Then, we are γ confident with respect to the randomness in the training dataset T that the claim is correct.
A natural question is how to find the exact critical constant λ that solves the equation which is an improvement to the conservative λ that solves Equation (4)  Next, we show how to find the exact critical constant λ under an additional assumption which is satisfied in some applications. Assume that, among the N future objects that need to be classified, N l objects are actually from the lth class with the distribution N(µ l , Σ l ), l = 1, . . . , k. The additional assumption we make is that where the r l s are assumed to be known constants in the interval [0, 1]. Intuitively, this assumption means that we know the proportions of the future objects that belong to each of the k classes, even though we do not know the true class of each individual future object.
The assumption in Equation (7) is reasonable in some applications. For example, when screening for a particular disease among a specific population for preventive purpose, there are k = 2 classes: having the disease (l = 1) or not having the disease (l = 2). If we know the prevalence of the disease, d, in the overall population, then r 1 = d and r 2 = 1 − d, even though we do not know whether an individual subject has the disease or not.
It is shown in the Appendix A that, under the assumption in Equation (7), Equation (6) is equivalent to and all the w l s, u l s and v lm s are independent, P w l | u l ,{v lm } {·} denotes the conditional probability about w l conditioning on (u l , {v lm }), and P u l ,{v lm } {·} denotes the probability about (u l , {v lm }).

Algorithm for Computing the Exact λ
We now consider how to compute the critical constant λ that solves Equation (8). Similar to Liu et al. (2019), this is accomplished by simulation in the following way. From the distributions given in Equation (9), in the sth repeat of simulation, s = 1, . . . , S, generate independent u s l ∼ N(0, I p /n l ) , v s l1 , . . . , v s l(n l −1) ∼ N(0, I p ) ; l = 1, . . . , k.
To find the λ s in Equation (10) for each s, we use simulation in the following way. Generate independent random vectors {w lq : q = 1, . . . , Q; l = 1, . . . , k} from N(0, I p ), where Q is the number of simulations for finding λ s . For each given value of λ s > 0, the expression on the left-side of Equation (10) can be computed by approximating each of the k probabilities involved using the corresponding proportions out of the Q simulations. It is also clear that this expression is monotone increasing in λ s . Hence, the λ s that solves Equation (10) can be found by using a searching algorithm; for example, the bi-section method is used in our R code. To approximate reasonably accurately the probabilities with the proportions, a large Q value, e.g., 10,000, should be used.
It is noteworthy from Equations (8) and (9) that λ depends only on γ, α, p, k, n 1 , . . . , n k , r 1 , . . . , r k (and the numbers of simulations S and Q, which determine the numerical accuracy of λ due to simulation randomness). It is also worth emphasizing that only one λ needs to be computed based on the observed training dataset T , which is then used for constructing the confidence sets C T (y j ) and classifying accordingly all future objects.
It is expected that larger values of S and Q will produce more accurate λ value, one can use the method discussed by Liu et al. (2019) to assess how the accuracy of λ depends on the values of S and Q. Similar to the work by Liu et al. (2019), it is recommended to set S = 10,000 and Q = 10,000 for reasonable computation time and accuracy of λ due to simulation randomness.

An Illustrative Example
As in the work of Liu et al. (2019), the famous iris dataset introduced by Fisher (1936) [8] is used in this section to illustrate the method proposed in this paper. The dataset contains k = 3 classes representing the three species/classes of Iris flowers (1 = setosa; 2 = versicolor; and 3 = virginica), and has n i = 50 observations from each class in T . Each observation gives the measurements (in centimeters) of the four variables: sepal length and width, and petal length and width.
We focus on the case that only the first two measurements, sepal length and width, are used for classification in order to easily illustrate the method since the acceptance sets A l = y ∈ R p : (y −μ l ) TΣ−1 l (y −μ l ) ≤ λ , l = 1, 2, 3 are two-dimensional and thus can be easily plotted in this case. Based on the fifty observations on p = 2 measurements from each of the three classes, thê µ l andΣ l were given by Liu et al. (2019).
For α = 5% and γ = 95%, the critical constant λ that solves Equation (4) was computed by Liu et al. (2019) to be λ con = 9.175 using S = 10,000 and Q = 10,000. The corresponding acceptance sets, based on which the confidence set C T (y) in Equation (1) can be constructed directly (cf. [6]), are given by A con l = y ∈ R p : (y −μ l ) TΣ−1 l (y −μ l ) ≤ λ con , l = 1, 2, 3 and plotted in Figure 1 by the dotted ellipsoidal region centered atμ l , marked by "+". Now, assume that we have the knowledge about the proportions of the three species among all the Iris flowers (r 1 , r 2 , r 3 ) and the Iris flowers that need to be classified reflect this composition. For the same α = 5%, γ = 95%, S = 10, 000 and Q = 10, 000, and with, for example, (r 1 , r 2 , r 3 ) = (0.3, 0.4, 0.3), the exact critical constant λ that solves Equation (6) is computed by our R program to be λ exa = 7.737. As expected, λ exa is smaller than λ con and, as a result, the corresponding confidence set C T (y) in Equation (1) with λ = λ exa and acceptance sets A exa l = y ∈ R p : (y −μ l ) TΣ−1 l (y −μ l ) ≤ λ exa , l = 1, 2, 3, are also smaller than the A con l given by Liu et al. (2019). The acceptance sets A exa l , l = 1, 2, 3 are plotted in Figure 1 by the solid ellipsoidal regions. For example, if a future object has y = (4.79, 2.35), marked by a solid dot in Figure 1, then the conservative confidence set of Liu et al. (2019) classifies the object as from Classes 2 and 3 since this y belongs to both A con 2 and A con 3 . However, the new exact confidence set of this paper classifies the object as from Class 2 only since this y belongs to A exa 2 but not A exa 1 or A exa 3 . This demonstrates the advantage of the new confidence set using λ exa in this paper over the conservative confidence set using λ con by Liu et al. (2019). We have also computed the value of λ exa for several other given (r 1 , r 2 , r 3 ). For example, λ exa = 7.706 for (r 1 , r 2 , r 3 ) = (1/3, 1/3, 1/3), λ exa = 7.865 for (r 1 , r 2 , r 3 ) = (0.1, 0.45, 0.45), and λ exa = 8.019 for (r 1 , r 2 , r 3 ) = (0.1, 0.7, 0.2). The conservative λ con = 9.175 is considerably, ranging from 14% to 19%, larger than these λ exa values.
One can download from http://www.personal.soton.ac.uk/wl/Classification/ the R computer program ExactConfidenceSetClassifier.R that implements this simulation method of computing the critical constant λ exa . The computation of one λ exa using (S, Q) = (10,000, 10,000) takes about 13 h on an ordinary Window's PC (Core(TM2) Duo CPU P8400@2.26 GHz).
However, it must be emphasized the new confidence set is valid only if the assumption in Equation (7)

Conclusions
The probability statement in Equation (5) allows that the confidence sets by Liu et al. (2019) have the nice interpretation that, with confidence level γ about the randomness in the training dataset T , at least 1 − α proportion of the confidence sets C T (y j ), j = 1, 2, . . . contain the true classes c j , j = 1, 2, . . . of the future objects y j , j = 1, 2, . . .. However, the confidence set given by Liu et al. (2019) is conservative in that the λ in the confidence set in Equation (1) is computed to solve the equation in Equation (4), which implies the constraint in Equation (5). This paper considers how to compute the λ in the confidence set in Equation (1) so that the probability in Equation (5) is equal to γ, i.e. from the Equation (6). The confidence sets using the λ that solves the Equation (6) have the confidence level equal to γ and so are exact. We show that this can be accomplished under the extra assumption given in Equation (7), which may be sensible in some applications.
As the λ exa that solves Equation (6) is smaller than the λ con that solves Equation (4) used by Liu et al. (2019), the new confidence sets are smaller and so better than the confidence sets given by Liu et al. (2019).
One wonders whether there are other sensible assumptions that allow the λ to be solved from Equation (6). This warrants further research.
If C T (y) for a future object y is empty then, since y must be from one of the k classes, C T (y) can be augmented to include the class that has the largest posterior probability using the naive Bayesian classifier as in the work by Liu et al. (2019). The probability statement in Equation (5) clearly holds under this augmentation to C T (y) only when C T (y) is empty.
There are applications in which information about the proportions r l would be known with uncertainty. For example, the training set may be a representative sample from the population and as such the proportion of each class can be estimated, or the proportions might have been estimated by a previous independent auxiliary dataset. If one replaces the r l s in Equation (8) by these estimates then the λ solved in Equation (8) will depend on these estimates and so be random. As a result, the probability statement in Equation (5)

Acknowledgments:
We would like to thank the referees for critical and constructive comments on the earlier version of the paper.

Conflicts of Interest:
The authors declare no conflict of interest.
Among the N future objects that need to be classified, let N l be the number of objects actually from the lth class with the feature measurements denoted as y l1 , . . . , y lN l , l = 1, . . . , k. Clearly, we have N 1 + · · · + N k = N and We have from the classical strong law of large numbers (cf. [16]) that in which the conditional expectation E y li |T is used since all the confidence sets C T (y li ) (i = 1, . . . , N l ) use the same training dataset T . By noting that y li , i = 1, . . . , N l are from the lth class and thus have the same distribution N(µ l , Σ l ), we have from the definition of C T (y) in Equation (1) that E y li |T I {c l ∈C T (y li )} = P y l1 |T {c l ∈ C T (y l1 )} = P y l1 |T (y l1 −μ l ) TΣ−1 l (y l1 −μ l ) ≤ λ where w l = Σ −1/2 l (y l1 − µ l ) ∼ N(0, I p ) u l = Σ −1/2 l (μ l − µ l ) ∼ N(0, I p /n l ) v lm = Σ −1/2 l z lm ∼ N(0, I p ), m = 1, · · · , n l − 1 with all the w l s, u l s and v lm s being independent. Note that w l depends on the future observation y l1 but not the training dataset T , while u l and {v lm } depend on the training dataset T but not the future observations.
Combining the assumption in Equation (7)