Conﬁdence Sets for Statistical Classiﬁcation

: Classiﬁcation has applications in a wide range of ﬁelds including medicine, engineering, computer science and social sciences among others. In statistical terms, classiﬁcation is inference about the unknown parameters, i.e., the true classes of future objects. Hence, various standard statistical approaches can be used, such as point estimators, conﬁdence sets and decision theoretic approaches. For example, a classiﬁer that classiﬁes a future object as belonging to only one of several known classes is a point estimator. The purpose of this paper is to propose a conﬁdence-set-based classiﬁer that classiﬁes a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classiﬁcation of an object into possibly more than one class, this classiﬁer guarantees a pre-speciﬁed proportion of correct classiﬁcation among all future objects. An example is provided to illustrate the method, and a simulation study is included to highlight the desirable feature of the method.


Introduction
Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences, among others. See, e.g., the recent books by [1][2][3][4]. Classical examples include medical diagnosis, automatic character recognition, data mining (such as credit scoring, consumer sales analysis and credit card transaction analysis) and artificial intelligence (such as the development of machines with brain-like performance). As many important developments in this area are not confined to the statistics literature, various other names, such as supervised learning, pattern recognition and machine learning, have been used. In recent years, there have been many exciting new developments in both methodology and applications, taking advantage of increased computational power readily available nowadays. Broadly speaking, classification methods can be divided into probabilistic methods (including Bayesian classifiers), regression methods (including logistic regression and regression trees), geometric methods (including support vector machines), and ensemble methods (combining classifiers for improved robustness).
A classifier is a decision rule built from a training data set T that classifies all future objects as belonging to one or several of the k known classes, where k is a pre-specified number. The drawback of a classifier that classifies each future object into only one of k classes is that, when the object is close to the classification boundaries of several classes, M(≥ 2) say, the chance of misclassification is close to (M − 1)/M, which may be close to one when M is large. A sensible approach in this situation is to acknowledge that such an object has similar chances of belonging to M classes and hence to avoid classifying it into only one of the M classes. In medical diagnosis, for example, if there is not enough evidence to classify a patient as having a disease or not, then it is wise not to give a diagnosis that is quite likely to be wrong.
Various procedures have been proposed in the literature to deal with this difficulty. One type of procedure allows a rejection option, that is, if a future object falls into a 'rejection' region, then no classification is made for the object. Such a procedure aims to construct a suitable rejection region to minimize a pre-specified risk; see, e.g., [5][6][7][8][9] and the references therein. Non-deterministic classifiers are proposed in [10], which allow a future object to be classified possibly into several classes. Again, such a classifier is constructed to minimize a pre-specified risk.
For the binary classification problem (i.e., k = 2), ref. [11] proposes to find two 'tolerance' regions (corresponding to the two classes) in the feature/predictor space, with a specific coverage level for each class that minimize the probability that an object falls into the intersection of the two tolerance regions since an object in this intersection will not be classified. This approach is akin to the decision-theoretic approaches mentioned in the last paragraph but uses this specific probability as the risk to minimize. As with other decision-theoretic approaches, it is not constructed to guarantee the proportion of correction classification and thus is different from the approach proposed in this paper. Further development of this approach is considered in [12].
The conformal prediction approach of [13,14] also classifies a future object into possibly several classes that contain the true class with a pre-specified probability. However, this approach is designed for the 'online' setting in which the true classes of all the observed objects are revealed and hence known before the classification of the next object is made. This online setting is different from the usual setting of classification considered in this paper, in which a classifier is built from the available training data set T and then used to classify a large number of future objects without knowing their true classes.
For the binary classification problem, ref. [15] proposes a classifier that allows no classification of an object. By controlling the size of the non-classification region (for which classification error does not occur) via a tuning constant, a 'generalized error' of the classifier is controlled at a pre-specified level with a specified confidence about the randomness in the training data set T . The construction of this classifier is related to the tolerance sets going back to [16]. Note, however, that the algorithm in [15] may result in a different classifier if a different observation in the training data set is used as the 'base' instance in the algorithm, which is quite odd from a statistical point of view. In addition, the 'generalized error' is different from the long run frequency of correct classification, which the procedure proposed in this paper aims to control.
The purpose of this paper is to propose a classifier that classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into potentially more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects. Specifically, classification of a future object is treated as a standard problem of statistical inference about the unknown parameter c, the true class of the object, and the confidence set approach for c is adopted. In order to consider the probability of correct classification, it is necessary to assume certain probability distributions for the feature measurements from the k classes. In this paper the feature measurements of the k classes are assumed to follow multivariate normal distributions, which is widely used either directly or after some transformation (see [17,18]).
The layout of the paper is as follows: Section 2 contains some preliminaries, including the idea of [19,20] from which the new approach proposed in this paper is developed. The simple situation where the means µ i and covariance matrices Σ i of the k multivariate normal distributions underlying the k classes are assumed to be known is considered in Section 3. The more realistic situation where both µ i and Σ i are unknown parameters is studied in Section 4. Section 5 provides an illustrative example. A simulation study is given in Section 6 to highlight the major advantage of the new classifier proposed in this paper. Section 7 provides the conclusions. Finally, some mathematical details are provided in the Appendix A.

Preliminaries
Let the p-dimensional data vector x i = (x i1 , . . . , x ip ) T denote the feature measurement on an object from the ith class, which has multivariate normal distribution N(µ i , Σ i ), i = 1, . . . , k . The training data set is given by T = {x i1 , . . . , x in i ; i = 1, . . . , k}, where x i1 , . . . , x in i are i.i.d. observations from the ith class with distribution N(µ i , Σ i ), i = 1, . . . , k. The classification problem is to make inference about c, the true class of a future object, based on the feature measurement y = (y 1 , . . . , y p ) T observed on the object, which is only known to belong to one of the k classes and so follows one of the k multivariate normal distributions. In statistical terminology, c is the unknown parameter of interest that takes a possible value in the simple parameter space C = {1, . . . , k}. We emphasize that c is treated as non-random in our frequentist approach.
A classifier that classifies an object with measurement y into one single class in C = {1, . . . , k} can be regarded as a point estimator of c. The classifier proposed in this paper provides a set C T (y) ⊆ C as plausible values of c. Depending on y and the training data set T , C T (y) may contain only a single value, in which case y is classified into one single class given by C T (y). When C T (y) contains more than one value in C, y is classified as possibly belonging to the several classes given by C T (y). Hence, in statistical terms, the classifier proposed in this paper uses the confidence set approach. The inherent advantage of the confidence set approach over the point estimation approach is the guaranteed 1 − α proportion of confidence sets that contain the true classes.
The confidence set for c is constructed below by inverting a family of acceptance sets for testing H 0 : c = l for each l ∈ C. This method of constructing a confidence set was given by [21] and has been used and generalized to construct numerous intriguing confidence sets; see, e.g., [22][23][24][25][26][27][28][29] Now, the key idea of [19,20] is presented very briefly, which is crucial for understanding our proposed approach to classification. Assuming that response y and predictor x are related by a standard linear regression model y = α 0 + α 1 x + and a training data set T on (y, x) is available for estimating α 0 , α 1 and the error variance σ 2 , refs. [19,20] consider how to construct confidence sets for the unknown (non-random) values of the predictor x corresponding to the large number of future observed values of the response y. As the same training data set T is used in the construction of all these confidence sets, the randomness in the future y-values and the randomness in T clearly play different roles and thus should be treated differently. The procedure proposed in [19,20] has a probability of at least γ, with respect to the randomness in T that at least 1 − α proportion of all the confidence sets, constructed from the same T , include the true x-values, where γ and 1 − α are pre-specfied probabilities. This idea/approach has been studied by many researchers; see, e.g., [30][31][32][33][34][35][36] and the references therein. One fundamental result is that the confidence sets constructed from (γ, 1 − α) simultaneous tolerance intervals do satisfy the 'γ-probability-(1 − α)'-proportion property specified above. In particular, ref. [36] points out by constructing a counter example that the confidence sets constructed from (γ, 1 − α) pointwise tolerance intervals do not guarantee the 'γ-probability-(1 − α)-proportion' property in general. A similar idea is also used in [37] to construct confidence sets for the numbers of coins in all future bags with known weights.
Since a classifier is built from the training data set T and then used to classify a large number of future objects in terms of confidence sets for their true classes, the future observed y-values play similar roles as the future observed y-values whilst the unknown true classes of the future objects play similar roles as the unknown true x-values of the future observed y-values, in the approach of [19,20] given in the last paragraph. Hence, it is natural to adopt the approach of [19,20] to construct confidence sets for the unknown true classes of future objects with the 'γ-probability-(1 − α)-proportion' property, that is, the probability, with respect to the randomness in T , is at least γ that at least (1 − α) proportion of all the confidence sets constructed from the same T do include the unknown true classes of all future objects.

Known µ i and Σ i
In this section, the values of µ i and Σ i are assumed to be known, which helps to motivate and understand the confidence sets constructed in Section 4 for the more realistic situation where the values of µ i and Σ i are unknown. Since µ i and Σ i are known, no training data set T is required to estimate µ i and Σ i . Hence, the confidence sets in this section are denoted as C(y), without the subscript T .
If y is from the lth class, then y ∼ N(µ l , Σ l ) and so (y − µ l ) T Σ −1 l (y − µ l ) has the chi-square distribution χ 2 p with p degrees of freedom. We construct a 1 − α confidence set for the class c of the observed y by using [21] method of inverting a family of 1 − α acceptance sets for testing H 0 : c = l for each number l in C. Specifically, the acceptance set for H 0 : c = l is given by where λ = χ 2 p,1−α is the 1 − α quantile of the χ 2 p distribution. It follows directly from Neyman's method that the confidence set is given by It is straightforward to show, by using the Neyman-Pearson lemma, that the acceptance set A l in Equation (1) is optimal in terms of having the smallest volume among all the 1 − α acceptance sets for testing H 0 : c = l.
As for the usual confidence sets, it is desirable that, among the confidence sets C(y 1 ), C(y 2 ), . . . for the corresponding unknown true classes c 1 , c 2 , . . . ∈ C of the infinitely many future y j with distribution where I A denotes the indicator function of set A and so 1 N ∑ N j=1 I {cj∈C(yj)} is the proportion among the N confidence sets C(y j ) that contains the true classes c j . It is shown in the Appendix A that the property in Equation (3) holds with equality.
The interpretation of the property in Equation (3) is similar to that of a standard confidence set. The noteworthy difference is that the confidence sets C(y j ) are for possibly different parameters c j (j = 1, 2, . . .). In addition, note that, for each j, C(y j ) is a standard 1 − α level confidence set for c j , with y j being the only source of randomness. Figure 1 gives an illustrative example with k = 3, p = 2:  and α = 5% (and so λ = χ 2 2,0.95 = 5.991). Specifically, the acceptance set A l in Equation (1) is represented in Figure 1 by the ellipsoidal region centred at µ l , marked by '+', l = 1, 2, 3. If y ∈ A l then l is an element of the confidence set C(y) given in Equation (1). Hence, the following four situations can occur. (a) y falls into only one A l and so C(y) has a single class. For example, if y ∈ A 1 ∩ A c 2 ∩ A c 3 , then C(y) = {1}, i.e., y is classified as belonging to class 1. (b) y falls into two A l 's but not the other one, and so C(y) contains two classes. For example, if y ∈ A 1 ∩ A 2 ∩ A c 3 , then C(y) = {1, 2}, i.e., y is classified as belonging to possibly classes 1 or 2. (c) y falls into all the three A l 's, i.e., y ∈ A 1 ∩ A 2 ∩ A 3 , and so y is classified as belonging to possibly all three classes. (d) y falls outside all the A l 's, i.e., y ∈ A c 1 ∩ A c 2 ∩ A c 3 , and so C(y) = ∅ and y is classified as not belonging to any one of the three classes. There is nothing wrong with this last classification since this y is judged not to be from the class l by the acceptance set A l for each l, though such a y must be rare in order to guarantee the property in Equation (3). On the other hand, since it is known that y is from one of the k classes, it is sensible to classify y according to any reasonable classifier, e.g., a Bayesian classifier illustrated in the next paragraph. As the resultant confidence set C * (y) from this augmentation contains C(y), the property in Equation (3) clearly still holds for C * (y).
For example, if y = (4.5, 2.0) T , which is marked by '*' in Figure 1, then C(y) = ∅. The Bayesian classifier and the augmented confidence set C * (y) can be worked out in the following way. Assume a non-informative prior π(1) = π(2) = π(3) = 1/3 about the class c of y, then the posterior probability of y belonging to class l is given by where f (y, l) is the probability density function of N(µ l , Σ l ) and p(y) is the marginal density of y and so does not depend on l. Hence, the Bayesian classifier classifies y to the class i 0 that satisfies f (y, i 0 ) = max l∈C f (y, l). For y = (4.5, 2.0) T , we have f (y, 1) = 0.009, f (y, 2) = 0.612 and f (y, 3) = 0.046. Hence, the Bayesian classifier and the augmented confidence set C * (y) classify y to class 2.
In this particular example, A 1 and A 3 do not intersect as seen from Figure 1 and so any future y will not be classified to be in both classes 1 and 3. This reflects the fact that the distributions of the classes 1 and 3 are quite different/separated and so easy to distinguish. On the other hand, the distributions of the classes 2 and 3 are similar and so hard to distinguish. As a result, A 2 and A 3 have a large overlap and hence many future y's will be classified as belonging to both classes 2 and 3.

Methodology
Now, we consider the more realistic situation where both the values of µ i and Σ i are unknown and so need to be estimated from the training data set T , independent of the future observations y j (j = 1, 2, . . .) whose classes c j are unknown and need to be inferred.
The training data set T = {x i1 , . . . , x in i ; i = 1, . . . , k} can be used to estimate µ i and Σ i in the usual way: Mimicking the confidence set in Equation (2), we construct the confidence set for the class c of y as: where λ is a suitably chosen critical constant whose determination is considered next. As in Section 3, it is desirable that the proportion of the future confidence sets C T (y j ) (j = 1, 2, . . .) that include the true classes c j (j = 1, 2, . . .) should be at least 1 − α: It is shown in the Appendix A that a sufficient condition for guaranteeing Inequality (5) is where E y j |T denotes the conditional expectation with respect to the random variable y j conditioning on the training data set T (or, equivalently, {(μ 1 ,Σ 1 ), . . . , (μ k ,Σ k )}).
Since the value of the expression on the left-hand side of the inequality in Inequality (6) depends on T and T is random, Inequality (6) cannot be guaranteed for each observed T ; more detailed explanation on this is given in the Appendix A. We therefore guarantee Inequality (6) with a large (close to 1) probability γ with respect to the randomness in T , which is shown in the Appendix A to be equivalent to where w l ∼ N(0, I p ), u l ∼ N(0, I p /n l ), v lm ∼ N(0, I p ), m = 1, · · · , n l − 1 (8) and all the w l 's, u l 's and v lm 's are independent. This in turn guarantees that The interpretation of this statement is that, based on one observed training data set T , one constructs confidence sets C T (y j ) for the c j 's of all future y j (j = 1, 2, · · · ) and claims that at least 1 − α proportion of these confidence sets do contain the true c j 's. Then, we are γ confident with respect to the randomness in the training data set T that the claim is correct.
It is noteworthy that for the classification problem considered in this paper a classifier is built from one training data set T and then used to classify a large number of future y j 's. Hence, the randomness in both the training data set T and the future y j 's need to be accounted for but in different ways. This is reflected in our approach by the two numbers 1 − α and γ, analogous to the idea of [19,20] as pointed out in Section 2.
If we treat the two sources of randomness in y and T simultaneously on equal footing (instead of the approach given above), then it is straightforward to show that ( [38], Section 5.2) where c is the true class of y, and F p,n c −p denotes an F random variable with degrees of freedom p and n c − p. It follows therefore from Neyman's method that is a 1 − α confidence set for c, where f p,n l −p,1−α is the 1 − α quantile of F p,n l −p . However, this confidence set has the following coverage frequency interpretation. Collect one training data set T and the feature y of one future object, both of which are then used to compute the confidence set C(T , y) for the class c of y; then, the frequency of the confidence sets that contain the true c's is 1 − α among a large number of confidence sets constructed in this way. Note that, in this construction, one training data set T is used only once with one future y to produce one confidence set C(T , y), and so the randomness in one T and the randomness in one future y are treated on equal footing. This is clearly different from what is considered in this paper and how statistical classification is used in most applications: only one training data set T is used to construct a classifier, which is then used repeatedly in classification of a large number of future objects with observed y values. Hence, our proposed new method treats the two sources of randomness in T and future y's differently.

Algorithm for Computing λ
We now consider how to compute the critical constant λ so that the probability P T in Equation (7) is equal to γ. This is accomplished by simulation in the following way. From the distributions given in Equation (8) Repeat this S times to get λ 1 , . . . , λ S and order these as λ [1] ≤ . . . ≤ λ [S] . It is well known [39] that λ [γS] converges to the required critical constant λ with probability one as S → ∞. Hence, λ [γS] is used as the required critical constant λ for a large S value, 10,000 say.
To find the λ s in Equation (11) for each s, we also use simulation in the following way. Generate independent random vectors {w lq : q = 1, . . . , Q; l = 1, . . . , k} from N(0, I p ), where Q is the number of simulations for finding λ s . For each l, denote converges to the population 1 − α quantile λ sl with probability one as Q → ∞, where λ sl satisfies Hence, max 1≤l≤k t s l[(1−α)Q] converges to max 1≤l≤k λ sl = λ s as Q → ∞ and is used as an approximation to λ s for a large Q value, 10,000 say.
It is noteworthy that λ depends only on γ, α, p, k, n 1 , . . . , n k (and the numbers of simulations S and Q which determine the numerical accuracy of λ due to simulation randomness). One can download from [40] our R computer program ConfidenceSetClassifier.R that implements this simulation method of computing the critical constant λ. While it is expected that larger values of S and Q will produce a more accurate λ value, it must be pointed out that there is no easy way to assess how the accuracy of λ depends on the values of S and Q. One practical way is to compute several λ values using different random seeds in the simulation for given S and Q, which form a random sample from the population of possible λ values. These λ values provide information on the variability among the possible λ values produced by the simulation method, and so accuracy of λ due to simulation randomness. See more details in Section 5.
As in Section 3, the confidence set C T (y) in Equation (4) may be empty for a y and so y is classified as not belonging to any of the c classes. As discussed in Section 3, there is nothing wrong with this, but it is sensible to classify such a y according to any reasonable classifier. The resultant confidence set C * T (y) from this augmentation contains C T (y), and so Inequality (9) still holds for C * T (y).

An Illustrative Example
The famous iris data set introduced by [41] is used in this section to illustrate the method proposed in this paper. The data set is simple but serves the purpose of illustration nevertheless. It contains k = 3 classes representing the three species/classes of Iris flowers (1 = setosa, 2 = versicolor, 3 = virginica), and has n i = 50 observations from each class in T . Each observation gives the measurements (in centimetres) of the four variables: sepal length and width, and petal length and width. The data set iris can be found in ( [42], Chapter 10) for example, and is also in the R base package.
First, we assume that only the first two measurements, sepal length and width, are used for classification in order to easily illustrate the method since the acceptance sets A l , l = 1, 2, 3 are two-dimensional and so can be easily plotted in this case. Based on the fifty observations on p = 2 measurements from each of the three classes, one can calculate that In the example in Section 3, these are used as the known values of µ i and Σ i for the three classes. For α = 5% and γ = 95%, the critical constant λ in Equation (7) is computed by our R program to be 9.175 using S = 10, 000 and Q = 10, 000. The confidence set C T (y) in (4) is based on the acceptance sets A l = y ∈ R p : (y −μ l ) TΣ−1 l (y −μ l ) ≤ λ , l = 1, 2, 3, which are plotted in Figure 2 by the ellipsoidal region centred atμ l , marked by '+', l = 1, 2, 3. These ellipsoidal regions are larger than, but have the same centers and shapes as, the corresponding ellipsoidal regions given in Figure 1 of Section 3. This reflects the fact that the underlying multivariate normal distributions have been estimated from the training data T in this case and so involve uncertainty, while the distributions in Section 3 are assumed to be known. The index l is an element of the confidence set C T (y) in Equation (3) if and only if y ∈ A l . Hence, the following four situations can occur, similar to those in Section 3. (a) y falls into only one A l and so C T (y) has only one class. (b) y falls into two A l 's but not the other one, and so C T (y) contains two classes. (c) y falls into all the three A l 's, i.e., y ∈ A 1 ∩ A 2 ∩ A 3 , and so y is classified as belonging to possibly all three classes. (d) y falls outside all the A l 's, i.e., y ∈ A c 1 ∩ A c 2 ∩ A c 3 , and so C T (y) = ∅ and y is classified as not belonging to any one of the three classes.
From Figure 2, it is clear that A 1 ∩ A 2 ∩ A 3 = ∅ and so for any future y ∈ A 1 ∩ A 2 ∩ A 3 the confidence set C T (y) = {1, 2, 3} that is, y is judged to be possibly from any of the three classes.
As in Section 3, if y does not belong to any A l , l = 1, 2, 3, we compute the augmented confidence set C * T (y) by using, for example, the naive Bayesian classifier with a non-informative prior that classifies y to the class i 0 that satisfiesf (y, i 0 ) = max l∈Cf (y, l), wheref (·, l) is the multivariate normal density function of the lth class with µ l and Σ l replaced by the estimatesμ l andΣ l , respectively. To get some idea of how sensitive the critical constant λ is to the simulation numbers S and Q, we have computed λ for various (S, Q) with γ = 0.95, α = 0.05, p = 2, k = 3 and n 1 = n 2 = n 3 = 50 on an ordinary Windows PC (Core (TM2) Due CPU P8400@2.26 GHz ). As it is expected that larger values of S and Q will produce more accurate λ value, the results given in Table 1 indicate that the λ value based on (S, Q) = (10, 000, 10, 000), in comparison with the λ value based on (S, Q) = (20, 000, 20, 000), is accurate to at least the first decimal place and so probably sufficiently accurate for most real problems. Alternatively, one can compute several λ values for the given S and Q values using different random seeds to assess the accuracy of a λ value computed. For example, fourteen λ values based on (S, Q) = (10, 000, 10, 000) based on fourteen different random seeds are computed to be: 9.231, 9.188, 9.172, 9.223, 9.192, 9.178, 9.203, 9.191, 9.198, 9.225, 9.182, 9.189, 9.224, 9.181, which form a sample of observations from the population distribution of all possible values of λ. This sample can then be used to infer the population and, in particular, the standard deviation of the population which gives the variability (or accuracy) of one λ value from the population. The mean and standard deviation of this sample of fourteen observations are given by 9.198 and 0.0196, respectively, and so the λ value based on (S, Q) = (10, 000, 10, 000) is expected to be within the range 9.198 ± 3 × 0.0196 using the "three-sigma" rule.
It is also worth emphasizing that only one λ needs to be computed based on the observed training dataset T which is then used for classifications of all future objects. Hence, one can always increase S and Q to achieve better accuracy of λ as required and computation time should not be of a great concern.

A Simulation Study
In this section, a simulation study is carried out to illustrate the desirable feature of the confidence-set based classifier (CS) proposed in this paper, and to highlight its differences from the following popular classifiers: classification tree (CT, implemented using R package tree), multinomial logistic regression (MLR, implemented using R package nnet), support vector machine (SVM, implemented using R package e1071) and naive Bayes (NB, implemented using R package e1071). The setting k = 3, p = 2, n 1 = n 2 = n 3 = 50, γ = 0.95 and α = 0.05 is considered following the illustrative example in the last section.
For each configuration of the three population distributions, a random sample of size n i = 50 is generated from each class/distribution to form the training data set T which is then used to train the classifiers CS, CT, MLR, SVM and NB. Each classifier is then used to classify N = 3000 future objects, with 1000 generated from each of the three classes/distributions; the proportion of correct classification, ζ, of the N = 3000 objects is recorded. For CS, the average size M of the confidence sets for the N = 3000 objects is also recorded; note that all the other classifiers classify each future object to only one class. This process is repeated for 100 times to produce ζ 1 , · · · , ζ 100 for each classifier, and M 1 , · · · , M 100 for CS only.
The results onγ,ζ andM are given in Table 2, with the corresponding standard deviations given in brackets. One can download from [40] our R computer program SimulationStudyF.R that implements this simulation study. Due to the property in Inequality (9) of CS, one expects thatγ ≥ γ = 0.95 for CS. This is indeed the case for each of the three configurations from the results in Table 2. Note, however, thatγ is either equal or close to zero for all the other classifiers. This is the advantage of CS, by construction, over the other classifiers. To guarantee the property in Inequality (9), the size of the confidence set may be larger than one as indicated by theM values in Table 2, while all the other classifiers select only one class for each future object. The average size of the confidence set depends on the configuration of the k = 3 classes. As expected,M tends to be smaller when the k = 3 classes are easier to distinguish, but larger when the k = 3 classes are harder to distinguish. For example, CONF3 has a considerably smallerM than CONF1.
As CS has the property in Inequality (9), it is not surprising thatζ is likely to be larger than 1 − α = 0.95, which is born out by the results in Table 2. However, for the other classifiers, the value of ζ depends on how different the k = 3 classes are;ζ tends to be larger when the k = 3 classes are more different and thus easier to distinguish. For example, CONF3 has a largerζ than CONF1.

Conclusions
This paper considers how to deal with the classification problem using the novel confidence set approach by adapting the idea of [19,20] for inference about the predictor values of the observed response values in a standard linear regression model. Specifically, confidence sets C T (y j ) for the true classes c j of infinitely many future objects y j (j = 1, 2, . . .), based on one training data set T , have been constructed so that, with confidence level γ about the randomness in T , the proportion of the C T (y j )'s that contain the true c j 's is at least 1 − α.
The intuitive motivation underlying this method is that, when an object is judged to be possibly from several classes, we should accept this objectively rather than forcing ourselves to pick just one class, which entails a large chance of misclassification. By allowing an object to be classified as possibly from more than one class, the proportion of correct classification can be guaranteed to be at least 1 − α with a large probability γ about the randomness in the training data set T . This 'guaranteed probability γ about the randomness in T ' should be intuitive too since a T that is very misleading about the k classes will likely produce a classifier that makes many wrong classifications, and so only γ proportion of well behaved T will produce a classifier that give at least 1 − α future correct classifications.
The two sources of randomness, those in the training data T and in future objects y j , have been treated differently to reflect the fact that a classifier is built from one training data set T and then used to classify many future objects y j . If the two sources of randomness are treated on equal footing, then the confidence set in Equation (10) should be used, which has a very different coverage frequency interpretation.
In this paper, the objects y from each class are assumed to follow a multivariate normal distribution. How the proposed method can be generalized to, or may be affected by, non-normal distributions, such as the elliptically contoured distribution [38] (p. 47) is interesting and warrants further research.
A frequentist approach is proposed in this paper. One wonders whether a corresponding Bayesian approach is easier to construct. In a Bayesian approach, one uses the posterior distribution π(c j | y j , T ) to make an inference about the true class c j of the future object y j . In particular, one can easily construct a Bayesian credible set C B (y j ) for c j such that P{c j ∈ C B (y j ) | y j , T } ≥ 1 − α. However, it is not at all clear whether this construction guarantees that lim inf N→∞ 1 N N ∑ j=1 I {cj∈CB(yj)} ≥ 1 − α since it can be shown that π(c i , c j | y i , y j , T ) = π(c i | y i , T )π(c j | y j , T ), i.e., the posterior distributions of c i and c j for the two future objects y i and y j are not independent. Nevertheless, Bayesian approach warrants further research.
where the first equality above follows from the classical strong law of large numbers ( [43], p. 333), the second from the definition of C(y j ) in Equation (2), and the third from λ = χ 2 p,1−α . This completes the proof.
Next, we show that inf c j ∈C E y j |T I {cj∈CT (y j )} ≥ 1 − α implies lim inf where E y j |T denotes the conditional expectation with respect to the random variable y j conditioning on the training data set T (or, equivalently, all theμ i andΣ i ). We have from the classical strong law of large numbers [43] that for any N ≥ 1 since it is known that all the c j 's are in C. This completes the proof. Next, we provide a more tractable expression for inf c j ∈C E y j |T I {cj∈CT (y j )} in order to understand why Inequality (6) cannot be guaranteed for each observed T . From the definition of C T (y) in Equation (4), we have inf c j ∈C E y j |T I {cj∈CT (y j )} = inf c j ∈C P y j |T c j ∈ C T (y j ) = inf c j ∈C P y j |T (y j −μ c j ) TΣ−1 c j (y j −μ c j ) ≤ λ with all the w l 's, u l 's and v lm 's being independent. Note that w l depends on the future observation y l but not the training data set T , while u l and {v lm } depend on the training data set T but not the future observations. Since the conditional probability in Equation (13) depends on the training data set T (via the random vectors u l and {v lm }), Inequality (6), for any given value of λ, cannot be guaranteed for each observed training data set T , i.e., u l and {v lm }. For example, if the values of T are such that (w l − u l ) T 1 n l − 1 is substantially larger than λ (for a given constant λ) for most possible values of w l ∼ N(0, I p ), then the conditional probability in Equation (13) is smaller than 1/2 and hence 1 − α ∈ (1/2, 1).
We therefore guarantee Inequality (6) with a large (close to 1) probability γ with respect to the randomness in T , which is, from Equation (13), clearly equivalent to