Abstract
Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences among others. Liu et al. (2019) proposed a confidence-set-based classifier that classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into possibly more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects. However, the classifier uses a conservative critical constant. In this paper, we show how to determine the exact critical constant in applications where prior knowledge about the proportions of the future objects from each class is available. As the exact critical constant is smaller than the conservative critical constant given by Liu et al. (2019), the classifier using the exact critical constant is better than the classifier by Liu et al. (2019) as expected. An example is provided to illustrate the method.
1. Introduction
Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences among others. For overviews, the reader is referred to the books by [1,2,3,4,5]. In the recent paper, Liu et al. (2019) [6] proposed a new classifier based on confidence sets. It constructs a confidence set for the the unknown parameter c, the true class of each future object, and classifies the object as belonging to the set of classes given by the confidence set. Hence, this approach classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into potentially more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects with a pre-specified confidence about the randomness in the training data based on which the classifier is constructed.
However, the classifier of Liu et al. (2019) uses a conservative critical constant and so the resultant confidence sets may be larger than necessary. The purpose of this paper is to determine the exact critical constant and therefore to improve the classifier of Liu et al. (2019) in situations where one has prior knowledge about the proportions of the (infinite) future objects belonging to the k possible classes.
The layout of the paper is as follows. Section 2 gives a very brief review of the classifier of Liu et al. (2019), and then considers the determination of the exact critical constant under the additional knowledge/assumption given above. An illustrative example is given in Section 3 to demonstrate the advantage of the improved classifier proposed in this paper when the additional assumption holds. Section 4 contains conclusions and discussions. Finally, some mathematical details are provided in the Appendix A. As the same setting and notation as in the work by Liu et al. (2019) are used, it is recommended to read this paper in conjunction with the one by Liu et al. (2019).
2. Methodology
2.1. Methodology
Let the p-dimensional data vector denote the feature measurement on an object from the lth class, which has multivariate normal distribution , ; here, k denotes the total number of classes, which is a known number. The available training dataset is given by , where are i.i.d. observations from the lth class with distribution , . The classification problem is to make inference about c, the true class of a future object, based on the feature measurement observed on the object, which is only known to belong to one of the k classes and so follows one of the k multivariate normal distributions. In statistical terminology, c is the unknown parameter of interest that takes a possible value in the simple parameter space . We emphasize that c is treated as non-random in both the work of Liu et al. (2019) and here.
A classifier that classifies an object with measurement into one single class in can be regarded as a point estimator of c. The classifier of Liu et al. (2019) provides a set as plausible values of c. Depending on and the training dataset , may contain only a single value, in which case is classified into one single class given by . When contains more than one value in C, is classified as possibly belonging to the several classes given by . Hence, in statistical terms, the classifier uses the confidence set approach. The inherent advantage of the confidence set approach over the point estimation approach is the guaranteed proportion of confidence sets that contain the true classes.
Specifically, the set was constructed by Liu et al. (2019) as
where and , are, respectively, the usual estimators of the unknown and based on the training dataset , and is a suitably chosen critical constant whose determination is considered next. The intuition behind the definition of in Equation (1) is that a future object is likely to be from class l if and only if .
Note that the proportion of the future confidence sets () that include the true classes of () is given by . Thus, it is desirable that
where is a pre-specified large (close to 1) proportion, e.g., 0.95. While the constraint in Equation (2) is difficult to deal with, Liu et al. (2019) showed that a sufficient condition for guaranteeing Equation (2) is
where denotes the conditional expectation with respect to the random variable conditioning on the training dataset (or, equivalently, ).
Since the value of the expression on the left hand side of the inequality in Equation (3) (and in Equation (2) as well) depends on and is random, the inequality in Equation (3) cannot be guaranteed for each observed . We therefore guarantee Equation (3) with a large (close to 1) probability with respect to the randomness in :
which in turn guarantees that
Computer code in R was provided by Liu et al. (2019) to compute the that solves Equation (4), which allows the confidence sets in Equation (1) to be constructed for each future object.
The interpretation of Equations (5) and (6) below is that, based on one observed training dataset , one constructs confidence sets for the s of all future () and claims that at least proportion of these confidence sets do contain the true s. Then, we are confident with respect to the randomness in the training dataset that the claim is correct.
A natural question is how to find the exact critical constant that solves the equation
which is an improvement to the conservative that solves Equation (4) as given by Liu et al. (2019). Next, we show how to find the exact critical constant under an additional assumption which is satisfied in some applications.
Assume that, among the N future objects that need to be classified, objects are actually from the lth class with the distribution , . The additional assumption we make is that
where the s are assumed to be known constants in the interval . Intuitively, this assumption means that we know the proportions of the future objects that belong to each of the k classes, even though we do not know the true class of each individual future object.
The assumption in Equation (7) is reasonable in some applications. For example, when screening for a particular disease among a specific population for preventive purpose, there are classes: having the disease () or not having the disease (). If we know the prevalence of the disease, d, in the overall population, then and , even though we do not know whether an individual subject has the disease or not.
It is shown in the Appendix A that, under the assumption in Equation (7), Equation (6) is equivalent to
where
and all the s, s and s are independent, denotes the conditional probability about conditioning on , and denotes the probability about .
2.2. Algorithm for Computing the Exact
We now consider how to compute the critical constant that solves Equation (8). Similar to Liu et al. (2019), this is accomplished by simulation in the following way. From the distributions given in Equation (9), in the sth repeat of simulation, , generate independent
and find the so that
Repeat this S times to get and order these as . It is well known (cf. [7]) that converges to the required critical constant with probability one as . Hence, is used as the required critical constant for a large S value, e.g., 10,000.
To find the in Equation (10) for each s, we use simulation in the following way. Generate independent random vectors from , where Q is the number of simulations for finding . For each given value of , the expression on the left-side of Equation (10) can be computed by approximating each of the k probabilities involved using the corresponding proportions out of the Q simulations. It is also clear that this expression is monotone increasing in . Hence, the that solves Equation (10) can be found by using a searching algorithm; for example, the bi-section method is used in our R code. To approximate reasonably accurately the probabilities with the proportions, a large Q value, e.g., 10,000, should be used.
It is noteworthy from Equations (8) and (9) that depends only on (and the numbers of simulations S and Q, which determine the numerical accuracy of due to simulation randomness). It is also worth emphasizing that only one needs to be computed based on the observed training dataset , which is then used for constructing the confidence sets and classifying accordingly all future objects.
It is expected that larger values of S and Q will produce more accurate value, one can use the method discussed by Liu et al. (2019) to assess how the accuracy of depends on the values of S and Q. Similar to the work by Liu et al. (2019), it is recommended to set 10,000 and 10,000 for reasonable computation time and accuracy of due to simulation randomness.
3. An Illustrative Example
As in the work of Liu et al. (2019), the famous iris dataset introduced by Fisher (1936) [8] is used in this section to illustrate the method proposed in this paper. The dataset contains classes representing the three species/classes of Iris flowers (1 = setosa; 2 = versicolor; and 3 = virginica), and has observations from each class in . Each observation gives the measurements (in centimeters) of the four variables: sepal length and width, and petal length and width.
We focus on the case that only the first two measurements, sepal length and width, are used for classification in order to easily illustrate the method since the acceptance sets are two-dimensional and thus can be easily plotted in this case. Based on the fifty observations on measurements from each of the three classes, the and were given by Liu et al. (2019).
For and , the critical constant that solves Equation (4) was computed by Liu et al. (2019) to be using 10,000 and 10,000. The corresponding acceptance sets, based on which the confidence set in Equation (1) can be constructed directly (cf. [6]), are given by
and plotted in Figure 1 by the dotted ellipsoidal region centered at , marked by “+”.
Figure 1.
The exact (solid) and conservative (dotted) acceptance sets for the three classes.
Now, assume that we have the knowledge about the proportions of the three species among all the Iris flowers and the Iris flowers that need to be classified reflect this composition. For the same , , 10,000 and 10,000 and with, for example, , the exact critical constant that solves Equation (6) is computed by our R program to be . As expected, is smaller than and, as a result, the corresponding confidence set in Equation (1) with and acceptance sets , are also smaller than the given by Liu et al. (2019).
The acceptance sets are plotted in Figure 1 by the solid ellipsoidal regions. For example, if a future object has , marked by a solid dot in Figure 1, then the conservative confidence set of Liu et al. (2019) classifies the object as from Classes 2 and 3 since this belongs to both and . However, the new exact confidence set of this paper classifies the object as from Class 2 only since this belongs to but not or . This demonstrates the advantage of the new confidence set using in this paper over the conservative confidence set using by Liu et al. (2019). We have also computed the value of for several other given . For example, for , for , and for . The conservative is considerably, ranging from 14% to 19%, larger than these values.
One can download from http://www.personal.soton.ac.uk/wl/Classification/ the R computer program ExactConfidenceSetClassifier.R that implements this simulation method of computing the critical constant . The computation of one using (10,000, 10,000) takes about 13 h on an ordinary Window’s PC (Core(TM2) Duo CPU P8400@2.26 GHz).
4. Conclusions
The probability statement in Equation (5) allows that the confidence sets by Liu et al. (2019) have the nice interpretation that, with confidence level about the randomness in the training dataset , at least proportion of the confidence sets , contain the true classes , of the future objects , . However, the confidence set given by Liu et al. (2019) is conservative in that the in the confidence set in Equation (1) is computed to solve the equation in Equation (4), which implies the constraint in Equation (5). This paper considers how to compute the in the confidence set in Equation (1) so that the probability in Equation (5) is equal to , i.e. from the Equation (6). The confidence sets using the that solves the Equation (6) have the confidence level equal to and so are exact. We show that this can be accomplished under the extra assumption given in Equation (7), which may be sensible in some applications.
As the that solves Equation (6) is smaller than the that solves Equation (4) used by Liu et al. (2019), the new confidence sets are smaller and so better than the confidence sets given by Liu et al. (2019).
One wonders whether there are other sensible assumptions that allow the to be solved from Equation (6). This warrants further research.
If for a future object is empty then, since must be from one of the k classes, can be augmented to include the class that has the largest posterior probability using the naive Bayesian classifier as in the work by Liu et al. (2019). The probability statement in Equation (5) clearly holds under this augmentation to only when is empty.
There are applications in which information about the proportions would be known with uncertainty. For example, the training set may be a representative sample from the population and as such the proportion of each class can be estimated, or the proportions might have been estimated by a previous independent auxiliary dataset. If one replaces the s in Equation (8) by these estimates then the solved in Equation (8) will depend on these estimates and so be random. As a result, the probability statement in Equation (5) is no longer valid. How to deal with these applications warrants further research.
Finally, the classifier of Liu et al. (2019) is developed from the idea of Lieberman et al. [9,10]. The same idea was also used by, for example, Mee et al. (1991) [11], Han et al. (2016) [12], Liu et al. (2016) [13] and Peng et al. (2019) [14], who all used conservative critical constants as did Liu et al. (2019). The idea of this paper can be applied to all these works to compute exact critical constants under suitable extra assumptions.
Author Contributions
Methodology, software, F.B., A.J.H., W.L.; software, W.L.
Acknowledgments
We would like to thank the referees for critical and constructive comments on the earlier version of the paper.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Mathematical Details
In this appendix, we show the equivalence of Equations (6) and (8) under the assumption in Equation (7). Note first the well known fact (cf. [15]) that , with being i.i.d. random vectors independent of .
Among the N future objects that need to be classified, let be the number of objects actually from the lth class with the feature measurements denoted as , . Clearly, we have and
We have from the classical strong law of large numbers (cf. [16]) that
in which the conditional expectation is used since all the confidence sets () use the same training dataset . By noting that are from the lth class and thus have the same distribution , we have from the definition of in Equation (1) that
where
with all the s, s and s being independent. Note that depends on the future observation but not the training dataset , while and depend on the training dataset but not the future observations.
Combining the assumption in Equation (7) with Equations (A1)–(A3) gives
from which the equivalence of Equations (6) and (8) follows immediately.
References
- Flach, P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2017. [Google Scholar]
- Piegorsch, W.W. Statistical Data Analytics: Foundations for Data Mining, Informatics, and Knowledge Discovery; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
- Theodoridis, S.; Koutroumbas, K. Pattern Recognition, 4th ed.; Academic Press: Cambridge, MA, USA, 2009. [Google Scholar]
- Webb, A.R.; Copsey, K.D. Statistical Pattern Recognition, 3rd ed.; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
- Liu, W.; Bretz, F.; Srimaneekarn, N.; Peng, J.; Hayter, A.J. Confidence sets for statistical classification. Stats 2019, 2, 332–346. [Google Scholar] [CrossRef]
- Serfling, R. Approximation Theorems of Mathematical Statistics; Wiley: Hoboken, NJ, USA, 1980. [Google Scholar]
- Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
- Lieberman, G.J.; Miller, R.G., Jr. Simultaneous tolerance intervals in regression. Biometrika 1963, 50, 155–168. [Google Scholar] [CrossRef]
- Lieberman, G.J.; Miller, R.G., Jr.; Hamilton, M.A. Simultaneous discrimination intervals in regression. Biometrika 1967, 54, 133–145, Correction in: 1967, 58, 687. [Google Scholar] [CrossRef] [PubMed]
- Mee, R.W.; Eberhardt, K.R.; Reeve, C.P. Calibration and simultaneous tolerance intervals for regression. Technometrics 1991, 33, 211–219. [Google Scholar] [CrossRef]
- Han, Y.; Liu, W.; Bretz, F.; Wan, F.; Yang, P. Statistical calibration and exact one-sided simultaneous tolerance intervals for polynomial regression. J. Stat. Plan. Inference 2016, 168, 90–96. [Google Scholar] [CrossRef]
- Liu, W.; Han, Y.; Bretz, F.; Wan, F.; Yang, P. Counting by weighing: Know your numbers with confidence. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2016, 65, 641–648. [Google Scholar] [CrossRef]
- Peng, J.; Liu, W.; Bretz, F.; Hayter, A.J. Counting by weighing: Two-sided confidence intervals. J. Appl. Stat. 2019, 46, 262–271. [Google Scholar] [CrossRef]
- Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 3rd ed.; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
- Chow, Y.S.; Teicher, H. Martingales. In Probability Theory; Springer: New York, NY, USA, 1978. [Google Scholar]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
