Next Article in Journal
Bayesian Prediction of Order Statistics Based on k-Record Values from a Generalized Exponential Distribution
Previous Article in Journal
Geometric Interpretation of Errors in Multi-Parametrical Fitting Methods Based on Non-Euclidean Norms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Confidence Sets for Statistical Classification (II): Exact Confidence Sets

1
S3RI, School of Mathematics University of Southampton, Southampton SO17 1BJ, UK
2
Novartis Pharma AG, 4002 Basel, Switzerland
3
Department of Statistics and Operations Technology, University of Denver, Denver, CO 80208, USA
*
Author to whom correspondence should be addressed.
Stats 2019, 2(4), 439-446; https://doi.org/10.3390/stats2040030
Submission received: 4 October 2019 / Revised: 18 October 2019 / Accepted: 1 November 2019 / Published: 7 November 2019

Abstract

:
Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences among others. Liu et al. (2019) proposed a confidence-set-based classifier that classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into possibly more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects. However, the classifier uses a conservative critical constant. In this paper, we show how to determine the exact critical constant in applications where prior knowledge about the proportions of the future objects from each class is available. As the exact critical constant is smaller than the conservative critical constant given by Liu et al. (2019), the classifier using the exact critical constant is better than the classifier by Liu et al. (2019) as expected. An example is provided to illustrate the method.

1. Introduction

Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences among others. For overviews, the reader is referred to the books by [1,2,3,4,5]. In the recent paper, Liu et al. (2019) [6] proposed a new classifier based on confidence sets. It constructs a confidence set for the the unknown parameter c, the true class of each future object, and classifies the object as belonging to the set of classes given by the confidence set. Hence, this approach classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into potentially more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects with a pre-specified confidence γ about the randomness in the training data based on which the classifier is constructed.
However, the classifier of Liu et al. (2019) uses a conservative critical constant λ and so the resultant confidence sets may be larger than necessary. The purpose of this paper is to determine the exact critical constant λ and therefore to improve the classifier of Liu et al. (2019) in situations where one has prior knowledge about the proportions of the (infinite) future objects belonging to the k possible classes.
The layout of the paper is as follows. Section 2 gives a very brief review of the classifier of Liu et al. (2019), and then considers the determination of the exact critical constant λ under the additional knowledge/assumption given above. An illustrative example is given in Section 3 to demonstrate the advantage of the improved classifier proposed in this paper when the additional assumption holds. Section 4 contains conclusions and discussions. Finally, some mathematical details are provided in the Appendix A. As the same setting and notation as in the work by Liu et al. (2019) are used, it is recommended to read this paper in conjunction with the one by Liu et al. (2019).

2. Methodology

2.1. Methodology

Let the p-dimensional data vector x l = ( x l 1 , , x l p ) T denote the feature measurement on an object from the lth class, which has multivariate normal distribution N ( μ l , Σ l ) , l = 1 , , k ; here, k denotes the total number of classes, which is a known number. The available training dataset is given by T = { x l 1 , , x l n l ; l = 1 , , k } , where x l 1 , , x l n l are i.i.d. observations from the lth class with distribution N ( μ l , Σ l ) , l = 1 , , k . The classification problem is to make inference about c, the true class of a future object, based on the feature measurement y = ( y 1 , , y p ) T observed on the object, which is only known to belong to one of the k classes and so follows one of the k multivariate normal distributions. In statistical terminology, c is the unknown parameter of interest that takes a possible value in the simple parameter space C = { 1 , , k } . We emphasize that c is treated as non-random in both the work of Liu et al. (2019) and here.
A classifier that classifies an object with measurement y into one single class in C = { 1 , , k } can be regarded as a point estimator of c. The classifier of Liu et al. (2019) provides a set C T ( y ) C as plausible values of c. Depending on y and the training dataset T , C T ( y ) may contain only a single value, in which case y is classified into one single class given by C T ( y ) . When C T ( y ) contains more than one value in C, y is classified as possibly belonging to the several classes given by C T ( y ) . Hence, in statistical terms, the classifier uses the confidence set approach. The inherent advantage of the confidence set approach over the point estimation approach is the guaranteed 1 α proportion of confidence sets that contain the true classes.
Specifically, the set C T ( y ) C was constructed by Liu et al. (2019) as
C T ( y ) = l C : ( y μ l ^ ) T Σ ^ l 1 ( y μ l ^ ) λ ,
where μ l ^ = 1 n l m = 1 n l x l m and Σ ^ l = 1 n l 1 m = 1 n l ( x l m μ l ^ ) ( x l m μ l ^ ) T , l = 1 , , k , are, respectively, the usual estimators of the unknown μ l and Σ l based on the training dataset T = { x l 1 , , x l n l ; l = 1 , , k } , and λ is a suitably chosen critical constant whose determination is considered next. The intuition behind the definition of C T ( y ) in Equation (1) is that a future object y is likely to be from class l if and only if ( y μ l ^ ) T Σ ^ l 1 ( y μ l ^ ) λ .
Note that the proportion of the future confidence sets C T ( y j ) ( j = 1 , 2 , ) that include the true classes c j of y j ( j = 1 , 2 , ) is given by lim inf N 1 N j = 1 N I c j C T ( y j ) . Thus, it is desirable that
lim inf N 1 N j = 1 N I c j C T ( y j ) 1 α
where 1 α is a pre-specified large (close to 1) proportion, e.g., 0.95. While the constraint in Equation (2) is difficult to deal with, Liu et al. (2019) showed that a sufficient condition for guaranteeing Equation (2) is
inf c j C E y j | T I c j C T ( y j ) 1 α
where E y j | T denotes the conditional expectation with respect to the random variable y j conditioning on the training dataset T (or, equivalently, { ( μ ^ 1 , Σ ^ 1 ) , , ( μ ^ k , Σ ^ k ) } ).
Since the value of the expression on the left hand side of the inequality in Equation (3) (and in Equation (2) as well) depends on T and T is random, the inequality in Equation (3) cannot be guaranteed for each observed T . We therefore guarantee Equation (3) with a large (close to 1) probability γ with respect to the randomness in T :
P T inf c j C E y j | T I c j C T ( y j ) 1 α = γ ,
which in turn guarantees that
P T lim inf N 1 N j = 1 N I c j C T ( y j ) 1 α γ .
Computer code in R was provided by Liu et al. (2019) to compute the λ that solves Equation (4), which allows the confidence sets C T ( y j ) in Equation (1) to be constructed for each future object.
The interpretation of Equations (5) and (6) below is that, based on one observed training dataset T , one constructs confidence sets C T ( y j ) for the c j s of all future y j ( j = 1 , 2 , ) and claims that at least 1 α proportion of these confidence sets do contain the true c j s. Then, we are γ confident with respect to the randomness in the training dataset T that the claim is correct.
A natural question is how to find the exact critical constant λ that solves the equation
P T lim inf N 1 N j = 1 N I c j C T ( y j ) 1 α = γ
which is an improvement to the conservative λ that solves Equation (4) as given by Liu et al. (2019). Next, we show how to find the exact critical constant λ under an additional assumption which is satisfied in some applications.
Assume that, among the N future objects that need to be classified, N l objects are actually from the lth class with the distribution N ( μ l , Σ l ) , l = 1 , , k . The additional assumption we make is that
lim N N l N = r l , l = 1 , , k
where the r l s are assumed to be known constants in the interval [ 0 , 1 ] . Intuitively, this assumption means that we know the proportions of the future objects that belong to each of the k classes, even though we do not know the true class of each individual future object.
The assumption in Equation (7) is reasonable in some applications. For example, when screening for a particular disease among a specific population for preventive purpose, there are k = 2 classes: having the disease ( l = 1 ) or not having the disease ( l = 2 ). If we know the prevalence of the disease, d, in the overall population, then r 1 = d and r 2 = 1 d , even though we do not know whether an individual subject has the disease or not.
It is shown in the Appendix A that, under the assumption in Equation (7), Equation (6) is equivalent to
P u l , { v l m } l = 1 k r l P w l | u l , { v l m } ( w l u l ) T 1 n l 1 m = 1 n l 1 v l m v l m T 1 ( w l u l ) λ 1 α = γ
where
w l N ( 0 , I p ) , u l N ( 0 , I p / n l ) , v l m N ( 0 , I p ) , m = 1 , , n l 1
and all the w l s, u l s and v l m s are independent, P w l | u l , { v l m } { · } denotes the conditional probability about w l conditioning on ( u l , { v l m } ) , and P u l , { v l m } { · } denotes the probability about ( u l , { v l m } ) .

2.2. Algorithm for Computing the Exact λ

We now consider how to compute the critical constant λ that solves Equation (8). Similar to Liu et al. (2019), this is accomplished by simulation in the following way. From the distributions given in Equation (9), in the sth repeat of simulation, s = 1 , , S , generate independent
u l s N ( 0 , I p / n l ) , v l 1 s , , v l ( n l 1 ) s N ( 0 , I p ) ; l = 1 , , k .
and find the λ = λ s so that
l = 1 k r l P w l | u l s , { v l m s } ( w l u l s ) T 1 n l 1 m = 1 n l 1 v l m s v l m s T 1 ( w l u l s ) λ s = 1 α .
Repeat this S times to get λ 1 , , λ S and order these as λ [ 1 ] λ [ S ] . It is well known (cf. [7]) that λ [ γ S ] converges to the required critical constant λ with probability one as S . Hence, λ [ γ S ] is used as the required critical constant λ for a large S value, e.g., 10,000.
To find the λ s in Equation (10) for each s, we use simulation in the following way. Generate independent random vectors { w l q : q = 1 , , Q ; l = 1 , , k } from N ( 0 , I p ) , where Q is the number of simulations for finding λ s . For each given value of λ s > 0 , the expression on the left-side of Equation (10) can be computed by approximating each of the k probabilities involved using the corresponding proportions out of the Q simulations. It is also clear that this expression is monotone increasing in λ s . Hence, the λ s that solves Equation (10) can be found by using a searching algorithm; for example, the bi-section method is used in our R code. To approximate reasonably accurately the probabilities with the proportions, a large Q value, e.g., 10,000, should be used.
It is noteworthy from Equations (8) and (9) that λ depends only on γ , α , p , k , n 1 , , n k , r 1 , , r k (and the numbers of simulations S and Q, which determine the numerical accuracy of λ due to simulation randomness). It is also worth emphasizing that only one λ needs to be computed based on the observed training dataset T , which is then used for constructing the confidence sets C T ( y j ) and classifying accordingly all future objects.
It is expected that larger values of S and Q will produce more accurate λ value, one can use the method discussed by Liu et al. (2019) to assess how the accuracy of λ depends on the values of S and Q. Similar to the work by Liu et al. (2019), it is recommended to set S = 10,000 and Q = 10,000 for reasonable computation time and accuracy of λ due to simulation randomness.

3. An Illustrative Example

As in the work of Liu et al. (2019), the famous iris dataset introduced by Fisher (1936) [8] is used in this section to illustrate the method proposed in this paper. The dataset contains k = 3 classes representing the three species/classes of Iris flowers (1 = setosa; 2 = versicolor; and 3 = virginica), and has n i = 50 observations from each class in T . Each observation gives the measurements (in centimeters) of the four variables: sepal length and width, and petal length and width.
We focus on the case that only the first two measurements, sepal length and width, are used for classification in order to easily illustrate the method since the acceptance sets A l = y R p : ( y μ l ^ ) T Σ ^ l 1 ( y μ l ^ ) λ , l = 1 , 2 , 3 are two-dimensional and thus can be easily plotted in this case. Based on the fifty observations on p = 2 measurements from each of the three classes, the μ ^ l and Σ ^ l were given by Liu et al. (2019).
For α = 5 % and γ = 95 % , the critical constant λ that solves Equation (4) was computed by Liu et al. (2019) to be λ c o n = 9.175 using S = 10,000 and Q = 10,000. The corresponding acceptance sets, based on which the confidence set C T ( y ) in Equation (1) can be constructed directly (cf. [6]), are given by
A l c o n = y R p : ( y μ ^ l ) T Σ ^ l 1 ( y μ ^ l ) λ c o n , l = 1 , 2 , 3
and plotted in Figure 1 by the dotted ellipsoidal region centered at μ ^ l , marked by “+”.
Now, assume that we have the knowledge about the proportions of the three species among all the Iris flowers ( r 1 , r 2 , r 3 ) and the Iris flowers that need to be classified reflect this composition. For the same α = 5 % , γ = 95 % , S = 10,000 and Q = 10,000 and with, for example, ( r 1 , r 2 , r 3 ) = ( 0.3 , 0.4 , 0.3 ) , the exact critical constant λ that solves Equation (6) is computed by our R program to be λ e x a = 7.737 . As expected, λ e x a is smaller than λ c o n and, as a result, the corresponding confidence set C T ( y ) in Equation (1) with λ = λ e x a and acceptance sets A l e x a = y R p : ( y μ ^ l ) T Σ ^ l 1 ( y μ ^ l ) λ e x a , l = 1 , 2 , 3 , are also smaller than the A l c o n given by Liu et al. (2019).
The acceptance sets A l e x a , l = 1 , 2 , 3 are plotted in Figure 1 by the solid ellipsoidal regions. For example, if a future object has y = ( 4.79 , 2.35 ) , marked by a solid dot in Figure 1, then the conservative confidence set of Liu et al. (2019) classifies the object as from Classes 2 and 3 since this y belongs to both A 2 c o n and A 3 c o n . However, the new exact confidence set of this paper classifies the object as from Class 2 only since this y belongs to A 2 e x a but not A 1 e x a or A 3 e x a . This demonstrates the advantage of the new confidence set using λ e x a in this paper over the conservative confidence set using λ c o n by Liu et al. (2019). We have also computed the value of λ e x a for several other given ( r 1 , r 2 , r 3 ) . For example, λ e x a = 7.706 for ( r 1 , r 2 , r 3 ) = ( 1 / 3 , 1 / 3 , 1 / 3 ) , λ e x a = 7.865 for ( r 1 , r 2 , r 3 ) = ( 0.1 , 0.45 , 0.45 ) , and λ e x a = 8.019 for ( r 1 , r 2 , r 3 ) = ( 0.1 , 0.7 , 0.2 ) . The conservative λ c o n = 9.175 is considerably, ranging from 14% to 19%, larger than these λ e x a values.
One can download from http://www.personal.soton.ac.uk/wl/Classification/ the R computer program ExactConfidenceSetClassifier.R that implements this simulation method of computing the critical constant λ e x a . The computation of one λ e x a using ( S , Q ) = (10,000, 10,000) takes about 13 h on an ordinary Window’s PC (Core(TM2) Duo CPU [email protected] GHz).
However, it must be emphasized the new confidence set is valid only if the assumption in Equation (7) is true. If the assumption does not hold, then the conservative confidence set of Liu et al. (2019) should be used in order for the statement in Equation (5) to hold.

4. Conclusions

The probability statement in Equation (5) allows that the confidence sets by Liu et al. (2019) have the nice interpretation that, with confidence level γ about the randomness in the training dataset T , at least 1 α proportion of the confidence sets C T ( y j ) , j = 1 , 2 , contain the true classes c j , j = 1 , 2 , of the future objects y j , j = 1 , 2 , . However, the confidence set given by Liu et al. (2019) is conservative in that the λ in the confidence set in Equation (1) is computed to solve the equation in Equation (4), which implies the constraint in Equation (5). This paper considers how to compute the λ in the confidence set in Equation (1) so that the probability in Equation (5) is equal to γ , i.e. from the Equation (6). The confidence sets using the λ that solves the Equation (6) have the confidence level equal to γ and so are exact. We show that this can be accomplished under the extra assumption given in Equation (7), which may be sensible in some applications.
As the λ e x a that solves Equation (6) is smaller than the λ c o n that solves Equation (4) used by Liu et al. (2019), the new confidence sets are smaller and so better than the confidence sets given by Liu et al. (2019).
One wonders whether there are other sensible assumptions that allow the λ to be solved from Equation (6). This warrants further research.
If C T ( y ) for a future object y is empty then, since y must be from one of the k classes, C T ( y ) can be augmented to include the class that has the largest posterior probability using the naive Bayesian classifier as in the work by Liu et al. (2019). The probability statement in Equation (5) clearly holds under this augmentation to C T ( y ) only when C T ( y ) is empty.
There are applications in which information about the proportions r l would be known with uncertainty. For example, the training set may be a representative sample from the population and as such the proportion of each class can be estimated, or the proportions might have been estimated by a previous independent auxiliary dataset. If one replaces the r l s in Equation (8) by these estimates then the λ solved in Equation (8) will depend on these estimates and so be random. As a result, the probability statement in Equation (5) is no longer valid. How to deal with these applications warrants further research.
Finally, the classifier of Liu et al. (2019) is developed from the idea of Lieberman et al. [9,10]. The same idea was also used by, for example, Mee et al. (1991) [11], Han et al. (2016) [12], Liu et al. (2016) [13] and Peng et al. (2019) [14], who all used conservative critical constants as did Liu et al. (2019). The idea of this paper can be applied to all these works to compute exact critical constants under suitable extra assumptions.

Author Contributions

Methodology, software, F.B., A.J.H., W.L.; software, W.L.

Acknowledgments

We would like to thank the referees for critical and constructive comments on the earlier version of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Mathematical Details

In this appendix, we show the equivalence of Equations (6) and (8) under the assumption in Equation (7). Note first the well known fact (cf. [15]) that μ l ^ N ( μ l , Σ l / n l ) , ( n l 1 ) Σ ^ l = m = 1 n l 1 z l m z l m T with z l 1 , , z l ( n l 1 ) being i.i.d. N ( 0 , Σ l ) random vectors independent of μ l ^ .
Among the N future objects that need to be classified, let N l be the number of objects actually from the lth class with the feature measurements denoted as y l 1 , , y l N l , l = 1 , , k . Clearly, we have N 1 + + N k = N and
lim inf N 1 N j = 1 N I c j C T ( y j ) = lim inf N 1 N l = 1 k i = 1 N l I c l C T ( y l i ) = lim inf N l = 1 k N l N 1 N l i = 1 N l I c l C T ( y l i ) .
We have from the classical strong law of large numbers (cf. [16]) that
lim N l 1 N l i = 1 N l I c l C T ( y l i ) E y l i | T I c l C T ( y l i ) = 0 ,
in which the conditional expectation E y l i | T is used since all the confidence sets C T ( y l i ) ( i = 1 , , N l ) use the same training dataset T . By noting that y l i , i = 1 , , N l are from the lth class and thus have the same distribution N ( μ l , Σ l ) , we have from the definition of C T ( y ) in Equation (1) that
E y l i | T I c l C T ( y l i ) = P y l 1 | T c l C T ( y l 1 ) = P y l 1 | T ( y l 1 μ ^ l ) T Σ ^ l 1 ( y l 1 μ ^ l ) λ = P w l | u l , { v l m } ( w l u l ) T 1 n l 1 m = 1 n l 1 v l m v l m T 1 ( w l u l ) λ
where
w l = Σ l 1 / 2 ( y l 1 μ l ) N ( 0 , I p ) u l = Σ l 1 / 2 ( μ ^ l μ l ) N ( 0 , I p / n l ) v l m = Σ l 1 / 2 z l m N ( 0 , I p ) , m = 1 , , n l 1
with all the w l s, u l s and v l m s being independent. Note that w l depends on the future observation y l 1 but not the training dataset T , while u l and { v l m } depend on the training dataset T but not the future observations.
Combining the assumption in Equation (7) with Equations (A1)–(A3) gives
lim inf N 1 N j = 1 N I c j C T ( y j ) = l = 1 k r l P w l | u l , { v l m } ( w l u l ) T 1 n l 1 m = 1 n l 1 v l m v l m T 1 ( w l u l ) λ ,
from which the equivalence of Equations (6) and (8) follows immediately.

References

  1. Flach, P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
  2. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2017. [Google Scholar]
  3. Piegorsch, W.W. Statistical Data Analytics: Foundations for Data Mining, Informatics, and Knowledge Discovery; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
  4. Theodoridis, S.; Koutroumbas, K. Pattern Recognition, 4th ed.; Academic Press: Cambridge, MA, USA, 2009. [Google Scholar]
  5. Webb, A.R.; Copsey, K.D. Statistical Pattern Recognition, 3rd ed.; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
  6. Liu, W.; Bretz, F.; Srimaneekarn, N.; Peng, J.; Hayter, A.J. Confidence sets for statistical classification. Stats 2019, 2, 332–346. [Google Scholar] [CrossRef]
  7. Serfling, R. Approximation Theorems of Mathematical Statistics; Wiley: Hoboken, NJ, USA, 1980. [Google Scholar]
  8. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  9. Lieberman, G.J.; Miller, R.G., Jr. Simultaneous tolerance intervals in regression. Biometrika 1963, 50, 155–168. [Google Scholar] [CrossRef]
  10. Lieberman, G.J.; Miller, R.G., Jr.; Hamilton, M.A. Simultaneous discrimination intervals in regression. Biometrika 1967, 54, 133–145, Correction in: 1967, 58, 687. [Google Scholar] [CrossRef] [PubMed]
  11. Mee, R.W.; Eberhardt, K.R.; Reeve, C.P. Calibration and simultaneous tolerance intervals for regression. Technometrics 1991, 33, 211–219. [Google Scholar] [CrossRef]
  12. Han, Y.; Liu, W.; Bretz, F.; Wan, F.; Yang, P. Statistical calibration and exact one-sided simultaneous tolerance intervals for polynomial regression. J. Stat. Plan. Inference 2016, 168, 90–96. [Google Scholar] [CrossRef] [Green Version]
  13. Liu, W.; Han, Y.; Bretz, F.; Wan, F.; Yang, P. Counting by weighing: Know your numbers with confidence. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2016, 65, 641–648. [Google Scholar] [CrossRef]
  14. Peng, J.; Liu, W.; Bretz, F.; Hayter, A.J. Counting by weighing: Two-sided confidence intervals. J. Appl. Stat. 2019, 46, 262–271. [Google Scholar] [CrossRef]
  15. Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 3rd ed.; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
  16. Chow, Y.S.; Teicher, H. Martingales. In Probability Theory; Springer: New York, NY, USA, 1978. [Google Scholar]
Figure 1. The exact (solid) and conservative (dotted) acceptance sets for the three classes.
Figure 1. The exact (solid) and conservative (dotted) acceptance sets for the three classes.
Stats 02 00030 g001

Share and Cite

MDPI and ACS Style

Liu, W.; Bretz, F.; Hayter, A.J. Confidence Sets for Statistical Classification (II): Exact Confidence Sets. Stats 2019, 2, 439-446. https://doi.org/10.3390/stats2040030

AMA Style

Liu W, Bretz F, Hayter AJ. Confidence Sets for Statistical Classification (II): Exact Confidence Sets. Stats. 2019; 2(4):439-446. https://doi.org/10.3390/stats2040030

Chicago/Turabian Style

Liu, Wei, Frank Bretz, and Anthony J. Hayter. 2019. "Confidence Sets for Statistical Classification (II): Exact Confidence Sets" Stats 2, no. 4: 439-446. https://doi.org/10.3390/stats2040030

APA Style

Liu, W., Bretz, F., & Hayter, A. J. (2019). Confidence Sets for Statistical Classification (II): Exact Confidence Sets. Stats, 2(4), 439-446. https://doi.org/10.3390/stats2040030

Article Metrics

Back to TopTop