1. Introduction
Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences, among others. See, e.g., the recent books by [
1,
2,
3,
4]. Classical examples include medical diagnosis, automatic character recognition, data mining (such as credit scoring, consumer sales analysis and credit card transaction analysis) and artificial intelligence (such as the development of machines with brain-like performance). As many important developments in this area are not confined to the statistics literature, various other names, such as supervised learning, pattern recognition and machine learning, have been used. In recent years, there have been many exciting new developments in both methodology and applications, taking advantage of increased computational power readily available nowadays. Broadly speaking, classification methods can be divided into probabilistic methods (including Bayesian classifiers), regression methods (including logistic regression and regression trees), geometric methods (including support vector machines), and ensemble methods (combining classifiers for improved robustness).
A classifier is a decision rule built from a training data set that classifies all future objects as belonging to one or several of the k known classes, where k is a pre-specified number. The drawback of a classifier that classifies each future object into only one of k classes is that, when the object is close to the classification boundaries of several classes, say, the chance of misclassification is close to , which may be close to one when M is large. A sensible approach in this situation is to acknowledge that such an object has similar chances of belonging to M classes and hence to avoid classifying it into only one of the M classes. In medical diagnosis, for example, if there is not enough evidence to classify a patient as having a disease or not, then it is wise not to give a diagnosis that is quite likely to be wrong.
Various procedures have been proposed in the literature to deal with this difficulty. One type of procedure allows a rejection option, that is, if a future object falls into a ‘rejection’ region, then no classification is made for the object. Such a procedure aims to construct a suitable rejection region to minimize a pre-specified risk; see, e.g., [
5,
6,
7,
8,
9] and the references therein. Non-deterministic classifiers are proposed in [
10], which allow a future object to be classified possibly into several classes. Again, such a classifier is constructed to minimize a pre-specified risk.
For the binary classification problem (i.e.,
), ref. [
11] proposes to find two ‘tolerance’ regions (corresponding to the two classes) in the feature/predictor space, with a specific coverage level for each class that minimize the probability that an object falls into the intersection of the two tolerance regions since an object in this intersection will not be classified. This approach is akin to the decision-theoretic approaches mentioned in the last paragraph but uses this specific probability as the risk to minimize. As with other decision-theoretic approaches, it is not constructed to guarantee the proportion of correction classification and thus is different from the approach proposed in this paper. Further development of this approach is considered in [
12].
The conformal prediction approach of [
13,
14] also classifies a future object into possibly several classes that contain the true class with a pre-specified probability. However, this approach is designed for the ‘online’ setting in which the true classes of all the observed objects are revealed and hence known before the classification of the next object is made. This online setting is different from the usual setting of classification considered in this paper, in which a classifier is built from the available training data set
and then used to classify a large number of future objects without knowing their true classes.
For the binary classification problem, ref. [
15] proposes a classifier that allows no classification of an object. By controlling the size of the non-classification region (for which classification error does not occur) via a tuning constant, a ‘generalized error’ of the classifier is controlled at a pre-specified level with a specified confidence about the randomness in the training data set
. The construction of this classifier is related to the tolerance sets going back to [
16]. Note, however, that the algorithm in [
15] may result in a different classifier if a different observation in the training data set is used as the ‘base’ instance in the algorithm, which is quite odd from a statistical point of view. In addition, the ‘generalized error’ is different from the long run frequency of correct classification, which the procedure proposed in this paper aims to control.
The purpose of this paper is to propose a classifier that classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into potentially more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects. Specifically, classification of a future object is treated as a standard problem of statistical inference about the unknown parameter
c, the true class of the object, and the confidence set approach for
c is adopted. In order to consider the probability of correct classification, it is necessary to assume certain probability distributions for the feature measurements from the
k classes. In this paper the feature measurements of the
k classes are assumed to follow multivariate normal distributions, which is widely used either directly or after some transformation (see [
17,
18]).
The layout of the paper is as follows:
Section 2 contains some preliminaries, including the idea of [
19,
20] from which the new approach proposed in this paper is developed. The simple situation where the means
and covariance matrices
of the
k multivariate normal distributions underlying the
k classes are assumed to be known is considered in
Section 3. The more realistic situation where both
and
are unknown parameters is studied in
Section 4.
Section 5 provides an illustrative example. A simulation study is given in
Section 6 to highlight the major advantage of the new classifier proposed in this paper.
Section 7 provides the conclusions. Finally, some mathematical details are provided in the
Appendix A.
2. Preliminaries
Let the p-dimensional data vector denote the feature measurement on an object from the ith class, which has multivariate normal distribution , . The training data set is given by , where are i.i.d. observations from the ith class with distribution , . The classification problem is to make inference about c, the true class of a future object, based on the feature measurement observed on the object, which is only known to belong to one of the k classes and so follows one of the k multivariate normal distributions. In statistical terminology, c is the unknown parameter of interest that takes a possible value in the simple parameter space . We emphasize that c is treated as non-random in our frequentist approach.
A classifier that classifies an object with measurement into one single class in can be regarded as a point estimator of c. The classifier proposed in this paper provides a set as plausible values of c. Depending on and the training data set , may contain only a single value, in which case is classified into one single class given by . When contains more than one value in C, is classified as possibly belonging to the several classes given by . Hence, in statistical terms, the classifier proposed in this paper uses the confidence set approach. The inherent advantage of the confidence set approach over the point estimation approach is the guaranteed proportion of confidence sets that contain the true classes.
The confidence set for
c is constructed below by inverting a family of acceptance sets for testing
for each
. This method of constructing a confidence set was given by [
21] and has been used and generalized to construct numerous intriguing confidence sets; see, e.g., [
22,
23,
24,
25,
26,
27,
28,
29]
Now, the key idea of [
19,
20] is presented very briefly, which is crucial for understanding our proposed approach to classification. Assuming that response
y and predictor
x are related by a standard linear regression model
and a training data set
on
is available for estimating
and the error variance
, refs. [
19,
20] consider how to construct confidence sets for the unknown (non-random) values of the predictor
x corresponding to the large number of future observed values of the response
y. As the same training data set
is used in the construction of all these confidence sets, the randomness in the future
y-values and the randomness in
clearly play different roles and thus should be treated differently. The procedure proposed in [
19,
20] has a probability of at least
, with respect to the randomness in
that at least
proportion of all the confidence sets, constructed from the same
, include the true
x-values, where
and
are pre-specfied probabilities. This idea/approach has been studied by many researchers; see, e.g., [
30,
31,
32,
33,
34,
35,
36] and the references therein. One fundamental result is that the confidence sets constructed from
simultaneous tolerance intervals do satisfy the ‘
-probability-
’-proportion property specified above. In particular, ref. [
36] points out by constructing a counter example that the confidence sets constructed from
pointwise tolerance intervals do not guarantee the ‘
-probability-
-proportion’ property in general. A similar idea is also used in [
37] to construct confidence sets for the numbers of coins in all future bags with known weights.
Since a classifier is built from the training data set
and then used to classify a large number of future objects in terms of confidence sets for their true classes, the future observed
y-values play similar roles as the future observed
y-values whilst the unknown true classes of the future objects play similar roles as the unknown true
x-values of the future observed
y-values, in the approach of [
19,
20] given in the last paragraph. Hence, it is natural to adopt the approach of [
19,
20] to construct confidence sets for the unknown true classes of future objects with the ‘
-probability-
-proportion’ property, that is, the probability, with respect to the randomness in
, is at least
that at least
proportion of all the confidence sets constructed from the same
do include the unknown true classes of all future objects.
3. Known and
In this section, the values of
and
are assumed to be known, which helps to motivate and understand the confidence sets constructed in
Section 4 for the more realistic situation where the values of
and
are unknown. Since
and
are known, no training data set
is required to estimate
and
. Hence, the confidence sets in this section are denoted as
, without the subscript
.
If
is from the
lth class, then
and so
has the chi-square distribution
with
p degrees of freedom. We construct a
confidence set for the class
c of the observed
by using [
21] method of inverting a family of
acceptance sets for testing
for each number
l in
C. Specifically, the acceptance set for
is given by
where
is the
quantile of the
distribution. It follows directly from Neyman’s method that the confidence set is given by
It is straightforward to show, by using the Neyman–Pearson lemma, that the acceptance set
in Equation (
1) is optimal in terms of having the smallest volume among all the
acceptance sets for testing
.
As for the usual confidence sets, it is desirable that, among the confidence sets
for the corresponding unknown true classes
of the infinitely many future
with distribution
(
), at least
proportion will contain the true
’s. That is, it is desirable that
where
denotes the indicator function of set
A and so
is the proportion among the
N confidence sets
that contains the true classes
. It is shown in the
Appendix A that the property in Equation (
3) holds with equality.
The interpretation of the property in Equation (
3) is similar to that of a standard confidence set. The noteworthy difference is that the confidence sets
are for possibly different parameters
(
). In addition, note that, for each
j,
is a standard
level confidence set for
, with
being the only source of randomness.
Figure 1 gives an illustrative example with
,
:
and
(and so
). Specifically, the acceptance set
in Equation (
1) is represented in
Figure 1 by the ellipsoidal region centred at
, marked by ’+’,
. If
then
l is an element of the confidence set
given in Equation (
1). Hence, the following four situations can occur. (a)
falls into only one
and so
has a single class. For example, if
, then
, i.e.,
is classified as belonging to class 1. (b)
falls into two
’s but not the other one, and so
contains two classes. For example, if
, then
, i.e.,
is classified as belonging to possibly classes 1 or 2. (c)
falls into all the three
’s, i.e.,
, and so
is classified as belonging to possibly all three classes. (d)
falls outside all the
’s, i.e.,
, and so
and
is classified as not belonging to any one of the three classes. There is nothing wrong with this last classification since this
is judged not to be from the class
l by the acceptance set
for each
l, though such a
must be rare in order to guarantee the property in Equation (
3). On the other hand, since it is known that
is from one of the
k classes, it is sensible to classify
according to any reasonable classifier, e.g., a Bayesian classifier illustrated in the next paragraph. As the resultant confidence set
from this augmentation contains
, the property in Equation (
3) clearly still holds for
.
For example, if
, which is marked by ‘*’ in
Figure 1, then
. The Bayesian classifier and the augmented confidence set
can be worked out in the following way. Assume a non-informative prior
about the class
c of
, then the posterior probability of
belonging to class
l is given by
where
is the probability density function of
and
is the marginal density of
and so does not depend on
l. Hence, the Bayesian classifier classifies
to the class
that satisfies
. For
, we have
,
and
. Hence, the Bayesian classifier and the augmented confidence set
classify
to class 2.
In this particular example,
and
do not intersect as seen from
Figure 1 and so any future
will not be classified to be in both classes 1 and 3. This reflects the fact that the distributions of the classes 1 and 3 are quite different/separated and so easy to distinguish. On the other hand, the distributions of the classes 2 and 3 are similar and so hard to distinguish. As a result,
and
have a large overlap and hence many future
’s will be classified as belonging to both classes 2 and 3.
5. An Illustrative Example
The famous
iris data set introduced by [
41] is used in this section to illustrate the method proposed in this paper. The data set is simple but serves the purpose of illustration nevertheless. It contains
classes representing the three species/classes of Iris flowers (
setosa,
versicolor,
virginica), and has
observations from each class in
. Each observation gives the measurements (in centimetres) of the four variables: sepal length and width, and petal length and width. The data set
iris can be found in ([
42], Chapter 10) for example, and is also in the R base package.
First, we assume that only the first two measurements, sepal length and width, are used for classification in order to easily illustrate the method since the acceptance sets are two-dimensional and so can be easily plotted in this case. Based on the fifty observations on measurements from each of the three classes, one can calculate that
In the example in
Section 3, these are used as the known values of
and
for the three classes. For
and
, the critical constant
in Equation (
7) is computed by our R program to be 9.175 using
and
. The confidence set
in (4) is based on the acceptance sets
, which are plotted in
Figure 2 by the ellipsoidal region centred at
, marked by ‘+’,
. These ellipsoidal regions are larger than, but have the same centers and shapes as, the corresponding ellipsoidal regions given in
Figure 1 of
Section 3. This reflects the fact that the underlying multivariate normal distributions have been estimated from the training data
in this case and so involve uncertainty, while the distributions in
Section 3 are assumed to be known.
The index
l is an element of the confidence set
in Equation (
3) if and only if
. Hence, the following four situations can occur, similar to those in
Section 3. (a)
falls into only one
and so
has only one class. (b)
falls into two
’s but not the other one, and so
contains two classes. (c)
falls into all the three
’s, i.e.,
, and so
is classified as belonging to possibly all three classes. (d)
falls outside all the
’s, i.e.,
, and so
and
is classified as not belonging to any one of the three classes.
From
Figure 2, it is clear that
and so for any future
the confidence set
that is,
is judged to be possibly from any of the three classes.
As in
Section 3, if
does not belong to any
, we compute the augmented confidence set
by using, for example, the naive Bayesian classifier with a non-informative prior that classifies
to the class
that satisfies
, where
is the multivariate normal density function of the
lth class with
and
replaced by the estimates
and
, respectively.
To get some idea of how sensitive the critical constant
is to the simulation numbers
S and
Q, we have computed
for various
with
and
on an ordinary Windows PC (Core (TM2) Due CPU
[email protected] GHz ). As it is expected that larger values of
S and
Q will produce more accurate
value, the results given in
Table 1 indicate that the
value based on
, in comparison with the
value based on
, is accurate to at least the first decimal place and so probably sufficiently accurate for most real problems.
Alternatively, one can compute several values for the given S and Q values using different random seeds to assess the accuracy of a value computed. For example, fourteen values based on based on fourteen different random seeds are computed to be: 9.231, 9.188, 9.172, 9.223, 9.192, 9.178, 9.203, 9.191, 9.198, 9.225, 9.182, 9.189, 9.224, 9.181, which form a sample of observations from the population distribution of all possible values of . This sample can then be used to infer the population and, in particular, the standard deviation of the population which gives the variability (or accuracy) of one value from the population. The mean and standard deviation of this sample of fourteen observations are given by and , respectively, and so the value based on is expected to be within the range using the “three-sigma” rule.
It is also worth emphasizing that only one needs to be computed based on the observed training dataset which is then used for classifications of all future objects. Hence, one can always increase S and Q to achieve better accuracy of as required and computation time should not be of a great concern.
If all the four measurements are used in classification, then
and the acceptance sets
are four dimensional ellipsoidal balls and so cannot be drawn. Nevertheless, the confidence set
in Equation (
4) is still valid and can be computed easily for a given
. For
and
, the critical constant
in Equation (
8) is computed by our R program to be
using
and
. Now, suppose a future Iris flower has measurements
. Then, it is easy to check that
since
, while
since
for both
and 3. Hence, the confidence set
in (4) is
, that is, this Iris flower is classified as from class 1, i.e., Setosa.
6. A Simulation Study
In this section, a simulation study is carried out to illustrate the desirable feature of the confidence-set based classifier (CS) proposed in this paper, and to highlight its differences from the following popular classifiers: classification tree (CT, implemented using R package tree), multinomial logistic regression (MLR, implemented using R package nnet), support vector machine (SVM, implemented using R package e1071) and naive Bayes (NB, implemented using R package e1071). The setting and is considered following the illustrative example in the last section.
Three configurations of the
classes are considered in the simulation study. For the
and
given in the example in
Section 3, the first configuration (CONF1) has the
normal distributions given by
,
and
. The second configuration (CONF2) has the distributions
,
and
. The third configuration (CONF3) has the distributions
,
and
. CONF1 represents the situation that all the
classes are quite similar and thus hard to distinguish. CONF2 represents the situation that two of the
classes (i.e., classes 2 and 3) are quite similar but quite different from the other class (i.e., class 1). In CONF3, all the
classes are quite different and thus relatively easy to distinguish in comparison to CONF1 and CONF2.
For each configuration of the three population distributions, a random sample of size
is generated from each class/distribution to form the training data set
which is then used to train the classifiers CS, CT, MLR, SVM and NB. Each classifier is then used to classify
future objects, with 1000 generated from each of the three classes/distributions; the proportion of correct classification,
, of the
objects is recorded. For CS, the average size
M of the confidence sets for the
objects is also recorded; note that all the other classifiers classify each future object to only one class. This process is repeated for 100 times to produce
for each classifier, and
for CS only. Denote
and
for each classifier, and
for CS. The results on
,
and
are given in
Table 2, with the corresponding standard deviations given in brackets. One can download from [
40] our R computer program
SimulationStudyF.R that implements this simulation study.
Due to the property in Inequality (9) of CS, one expects that
for CS. This is indeed the case for each of the three configurations from the results in
Table 2. Note, however, that
is either equal or close to zero for all the other classifiers. This is the advantage of CS, by construction, over the other classifiers. To guarantee the property in Inequality (9), the size of the confidence set may be larger than one as indicated by the
values in
Table 2, while all the other classifiers select only one class for each future object. The average size of the confidence set depends on the configuration of the
classes. As expected,
tends to be smaller when the
classes are easier to distinguish, but larger when the
classes are harder to distinguish. For example, CONF3 has a considerably smaller
than CONF1.
As CS has the property in Inequality (9), it is not surprising that
is likely to be larger than
, which is born out by the results in
Table 2. However, for the other classifiers, the value of
depends on how different the
classes are;
tends to be larger when the
classes are more different and thus easier to distinguish. For example, CONF3 has a larger
than CONF1.
7. Conclusions
This paper considers how to deal with the classification problem using the novel confidence set approach by adapting the idea of [
19,
20] for inference about the predictor values of the observed response values in a standard linear regression model. Specifically, confidence sets
for the true classes
of infinitely many future objects
(
), based on one training data set
, have been constructed so that, with confidence level
about the randomness in
, the proportion of the
’s that contain the true
’s is at least
.
The intuitive motivation underlying this method is that, when an object is judged to be possibly from several classes, we should accept this objectively rather than forcing ourselves to pick just one class, which entails a large chance of misclassification. By allowing an object to be classified as possibly from more than one class, the proportion of correct classification can be guaranteed to be at least with a large probability about the randomness in the training data set . This ‘guaranteed probability about the randomness in ’ should be intuitive too since a that is very misleading about the k classes will likely produce a classifier that makes many wrong classifications, and so only proportion of well behaved will produce a classifier that give at least future correct classifications.
The two sources of randomness, those in the training data
and in future objects
, have been treated differently to reflect the fact that a classifier is built from one training data set
and then used to classify many future objects
. If the two sources of randomness are treated on equal footing, then the confidence set in Equation (
10) should be used, which has a very different coverage frequency interpretation.
In this paper, the objects
from each class are assumed to follow a multivariate normal distribution. How the proposed method can be generalized to, or may be affected by, non-normal distributions, such as the elliptically contoured distribution [
38] (p. 47) is interesting and warrants further research.
A frequentist approach is proposed in this paper. One wonders whether a corresponding Bayesian approach is easier to construct. In a Bayesian approach, one uses the posterior distribution
to make an inference about the true class
of the future object
. In particular, one can easily construct a Bayesian credible set
for
such that
. However, it is not at all clear whether this construction guarantees that
since it can be shown that
, i.e., the posterior distributions of
and
for the two future objects
and
are not independent. Nevertheless, Bayesian approach warrants further research.