2.1. Methodology
Let the p-dimensional data vector denote the feature measurement on an object from the lth class, which has multivariate normal distribution , ; here, k denotes the total number of classes, which is a known number. The available training dataset is given by , where are i.i.d. observations from the lth class with distribution , . The classification problem is to make inference about c, the true class of a future object, based on the feature measurement observed on the object, which is only known to belong to one of the k classes and so follows one of the k multivariate normal distributions. In statistical terminology, c is the unknown parameter of interest that takes a possible value in the simple parameter space . We emphasize that c is treated as non-random in both the work of Liu et al. (2019) and here.
A classifier that classifies an object with measurement into one single class in can be regarded as a point estimator of c. The classifier of Liu et al. (2019) provides a set as plausible values of c. Depending on and the training dataset , may contain only a single value, in which case is classified into one single class given by . When contains more than one value in C, is classified as possibly belonging to the several classes given by . Hence, in statistical terms, the classifier uses the confidence set approach. The inherent advantage of the confidence set approach over the point estimation approach is the guaranteed proportion of confidence sets that contain the true classes.
Specifically, the set
was constructed by Liu et al. (2019) as
where
and
, are, respectively, the usual estimators of the unknown
and
based on the training dataset
, and
is a suitably chosen critical constant whose determination is considered next. The intuition behind the definition of
in Equation (
1) is that a future object
is likely to be from class
l if and only if
.
Note that the proportion of the future confidence sets
(
) that include the true classes
of
(
) is given by
. Thus, it is desirable that
where
is a pre-specified large (close to 1) proportion, e.g., 0.95. While the constraint in Equation (2) is difficult to deal with, Liu et al. (2019) showed that a sufficient condition for guaranteeing Equation (2) is
where
denotes the conditional expectation with respect to the random variable
conditioning on the training dataset
(or, equivalently,
).
Since the value of the expression on the left hand side of the inequality in Equation (
3) (and in Equation (
2) as well) depends on
and
is random, the inequality in Equation (
3) cannot be guaranteed for each observed
. We therefore guarantee Equation (
3) with a large (close to 1) probability
with respect to the randomness in
:
which in turn guarantees that
Computer code in
R was provided by Liu et al. (2019) to compute the
that solves Equation (4), which allows the confidence sets
in Equation (
1) to be constructed for each future object.
The interpretation of Equations (5) and (6) below is that, based on one observed training dataset , one constructs confidence sets for the s of all future () and claims that at least proportion of these confidence sets do contain the true s. Then, we are confident with respect to the randomness in the training dataset that the claim is correct.
A natural question is how to find the exact critical constant
that solves the equation
which is an improvement to the conservative
that solves Equation (
4) as given by Liu et al. (2019). Next, we show how to find the exact critical constant
under an additional assumption which is satisfied in some applications.
Assume that, among the
N future objects that need to be classified,
objects are actually from the
lth class with the distribution
,
. The additional assumption we make is that
where the
s are assumed to be known constants in the interval
. Intuitively, this assumption means that we know the proportions of the future objects that belong to each of the
k classes, even though we do not know the true class of each individual future object.
The assumption in Equation (
7) is reasonable in some applications. For example, when screening for a particular disease among a specific population for preventive purpose, there are
classes: having the disease (
) or not having the disease (
). If we know the prevalence of the disease,
d, in the overall population, then
and
, even though we do not know whether an individual subject has the disease or not.
It is shown in the
Appendix A that, under the assumption in Equation (
7), Equation (
6) is equivalent to
where
and all the
s,
s and
s are independent,
denotes the conditional probability about
conditioning on
, and
denotes the probability about
.
2.2. Algorithm for Computing the Exact
We now consider how to compute the critical constant
that solves Equation (
8). Similar to Liu et al. (2019), this is accomplished by simulation in the following way. From the distributions given in Equation (
9), in the
sth repeat of simulation,
, generate independent
and find the
so that
Repeat this
S times to get
and order these as
. It is well known (cf. [
7]) that
converges to the required critical constant
with probability one as
. Hence,
is used as the required critical constant
for a large
S value, e.g., 10,000.
To find the
in Equation (
10) for each
s, we use simulation in the following way. Generate independent random vectors
from
, where
Q is the number of simulations for finding
. For each given value of
, the expression on the left-side of Equation (10) can be computed by approximating each of the
k probabilities involved using the corresponding proportions out of the
Q simulations. It is also clear that this expression is monotone increasing in
. Hence, the
that solves Equation (
10) can be found by using a searching algorithm; for example, the bi-section method is used in our R code. To approximate reasonably accurately the probabilities with the proportions, a large
Q value, e.g., 10,000, should be used.
It is noteworthy from Equations (8) and (9) that depends only on (and the numbers of simulations S and Q, which determine the numerical accuracy of due to simulation randomness). It is also worth emphasizing that only one needs to be computed based on the observed training dataset , which is then used for constructing the confidence sets and classifying accordingly all future objects.
It is expected that larger values of S and Q will produce more accurate value, one can use the method discussed by Liu et al. (2019) to assess how the accuracy of depends on the values of S and Q. Similar to the work by Liu et al. (2019), it is recommended to set 10,000 and 10,000 for reasonable computation time and accuracy of due to simulation randomness.