Confidence Sets for Statistical Classification

Wei Liu; Frank Bretz; Natchalee Srimaneekarn; Jianan Peng; Anthony J. Hayter

doi:10.3390/stats2030024

Abstract

Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences among others. In statistical terms, classification is inference about the unknown parameters, i.e., the true classes of future objects. Hence, various standard statistical approaches can be used, such as point estimators, confidence sets and decision theoretic approaches. For example, a classifier that classifies a future object as belonging to only one of several known classes is a point estimator. The purpose of this paper is to propose a confidence-set-based classifier that classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into possibly more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects. An example is provided to illustrate the method, and a simulation study is included to highlight the desirable feature of the method.

Keywords:

classification; confidence level; confidence set; coverage frequency; simultaneous tolerance intervals, statistical inference

1. Introduction

Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences, among others. See, e.g., the recent books by [1,2,3,4]. Classical examples include medical diagnosis, automatic character recognition, data mining (such as credit scoring, consumer sales analysis and credit card transaction analysis) and artificial intelligence (such as the development of machines with brain-like performance). As many important developments in this area are not confined to the statistics literature, various other names, such as supervised learning, pattern recognition and machine learning, have been used. In recent years, there have been many exciting new developments in both methodology and applications, taking advantage of increased computational power readily available nowadays. Broadly speaking, classification methods can be divided into probabilistic methods (including Bayesian classifiers), regression methods (including logistic regression and regression trees), geometric methods (including support vector machines), and ensemble methods (combining classifiers for improved robustness).

A classifier is a decision rule built from a training data set

T

that classifies all future objects as belonging to one or several of the k known classes, where k is a pre-specified number. The drawback of a classifier that classifies each future object into only one of k classes is that, when the object is close to the classification boundaries of several classes,

M (\geq 2)

say, the chance of misclassification is close to

(M - 1) / M

, which may be close to one when M is large. A sensible approach in this situation is to acknowledge that such an object has similar chances of belonging to M classes and hence to avoid classifying it into only one of the M classes. In medical diagnosis, for example, if there is not enough evidence to classify a patient as having a disease or not, then it is wise not to give a diagnosis that is quite likely to be wrong.

Various procedures have been proposed in the literature to deal with this difficulty. One type of procedure allows a rejection option, that is, if a future object falls into a ‘rejection’ region, then no classification is made for the object. Such a procedure aims to construct a suitable rejection region to minimize a pre-specified risk; see, e.g., [5,6,7,8,9] and the references therein. Non-deterministic classifiers are proposed in [10], which allow a future object to be classified possibly into several classes. Again, such a classifier is constructed to minimize a pre-specified risk.

For the binary classification problem (i.e.,

k = 2

), ref. [11] proposes to find two ‘tolerance’ regions (corresponding to the two classes) in the feature/predictor space, with a specific coverage level for each class that minimize the probability that an object falls into the intersection of the two tolerance regions since an object in this intersection will not be classified. This approach is akin to the decision-theoretic approaches mentioned in the last paragraph but uses this specific probability as the risk to minimize. As with other decision-theoretic approaches, it is not constructed to guarantee the proportion of correction classification and thus is different from the approach proposed in this paper. Further development of this approach is considered in [12].

The conformal prediction approach of [13,14] also classifies a future object into possibly several classes that contain the true class with a pre-specified probability. However, this approach is designed for the ‘online’ setting in which the true classes of all the observed objects are revealed and hence known before the classification of the next object is made. This online setting is different from the usual setting of classification considered in this paper, in which a classifier is built from the available training data set

T

and then used to classify a large number of future objects without knowing their true classes.

For the binary classification problem, ref. [15] proposes a classifier that allows no classification of an object. By controlling the size of the non-classification region (for which classification error does not occur) via a tuning constant, a ‘generalized error’ of the classifier is controlled at a pre-specified level with a specified confidence about the randomness in the training data set

T

. The construction of this classifier is related to the tolerance sets going back to [16]. Note, however, that the algorithm in [15] may result in a different classifier if a different observation in the training data set is used as the ‘base’ instance in the algorithm, which is quite odd from a statistical point of view. In addition, the ‘generalized error’ is different from the long run frequency of correct classification, which the procedure proposed in this paper aims to control.

The purpose of this paper is to propose a classifier that classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into potentially more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects. Specifically, classification of a future object is treated as a standard problem of statistical inference about the unknown parameter c, the true class of the object, and the confidence set approach for c is adopted. In order to consider the probability of correct classification, it is necessary to assume certain probability distributions for the feature measurements from the k classes. In this paper the feature measurements of the k classes are assumed to follow multivariate normal distributions, which is widely used either directly or after some transformation (see [17,18]).

The layout of the paper is as follows: Section 2 contains some preliminaries, including the idea of [19,20] from which the new approach proposed in this paper is developed. The simple situation where the means

μ_{i}

and covariance matrices

Σ_{i}

of the k multivariate normal distributions underlying the k classes are assumed to be known is considered in Section 3. The more realistic situation where both

μ_{i}

and

Σ_{i}

are unknown parameters is studied in Section 4. Section 5 provides an illustrative example. A simulation study is given in Section 6 to highlight the major advantage of the new classifier proposed in this paper. Section 7 provides the conclusions. Finally, some mathematical details are provided in the Appendix A.

2. Preliminaries

Let the p-dimensional data vector

x_{i} = {(x_{i 1}, \dots, x_{i p})}^{T}

denote the feature measurement on an object from the ith class, which has multivariate normal distribution

N (μ_{i}, Σ_{i})

,

i = 1, \dots, k

. The training data set is given by

T = {x_{i 1}, \dots, x_{i n_{i}}; i = 1, \dots, k}

, where

x_{i 1}, \dots, x_{i n_{i}}

are i.i.d. observations from the ith class with distribution

N (μ_{i}, Σ_{i})

,

i = 1, \dots, k

. The classification problem is to make inference about c, the true class of a future object, based on the feature measurement

y = {(y_{1}, \dots, y_{p})}^{T}

observed on the object, which is only known to belong to one of the k classes and so follows one of the k multivariate normal distributions. In statistical terminology, c is the unknown parameter of interest that takes a possible value in the simple parameter space

C = {1, \dots, k}

. We emphasize that c is treated as non-random in our frequentist approach.

A classifier that classifies an object with measurement

y

into one single class in

C = {1, \dots, k}

can be regarded as a point estimator of c. The classifier proposed in this paper provides a set

C_{T} (y) \subseteq C

as plausible values of c. Depending on

y

and the training data set

T

,

C_{T} (y)

may contain only a single value, in which case

y

is classified into one single class given by

C_{T} (y)

. When

C_{T} (y)

contains more than one value in C,

y

is classified as possibly belonging to the several classes given by

C_{T} (y)

. Hence, in statistical terms, the classifier proposed in this paper uses the confidence set approach. The inherent advantage of the confidence set approach over the point estimation approach is the guaranteed

1 - α

proportion of confidence sets that contain the true classes.

The confidence set for c is constructed below by inverting a family of acceptance sets for testing

H_{0} : c = l

for each

l \in C

. This method of constructing a confidence set was given by [21] and has been used and generalized to construct numerous intriguing confidence sets; see, e.g., [22,23,24,25,26,27,28,29]

Now, the key idea of [19,20] is presented very briefly, which is crucial for understanding our proposed approach to classification. Assuming that response y and predictor x are related by a standard linear regression model

y = α_{0} + α_{1} x + ϵ

and a training data set

T

on

(y, x)

is available for estimating

α_{0}, α_{1}

and the error variance

σ^{2}

, refs. [19,20] consider how to construct confidence sets for the unknown (non-random) values of the predictor x corresponding to the large number of future observed values of the response y. As the same training data set

T

is used in the construction of all these confidence sets, the randomness in the future y-values and the randomness in

T

clearly play different roles and thus should be treated differently. The procedure proposed in [19,20] has a probability of at least

γ

, with respect to the randomness in

T

that at least

1 - α

proportion of all the confidence sets, constructed from the same

T

, include the true x-values, where

γ

and

1 - α

are pre-specfied probabilities. This idea/approach has been studied by many researchers; see, e.g., [30,31,32,33,34,35,36] and the references therein. One fundamental result is that the confidence sets constructed from

(γ, 1 - α)

simultaneous tolerance intervals do satisfy the ‘

γ

-probability-

(1 - α)

’-proportion property specified above. In particular, ref. [36] points out by constructing a counter example that the confidence sets constructed from

(γ, 1 - α)

pointwise tolerance intervals do not guarantee the ‘

γ

-probability-

(1 - α)

-proportion’ property in general. A similar idea is also used in [37] to construct confidence sets for the numbers of coins in all future bags with known weights.

Since a classifier is built from the training data set

T

and then used to classify a large number of future objects in terms of confidence sets for their true classes, the future observed y-values play similar roles as the future observed y-values whilst the unknown true classes of the future objects play similar roles as the unknown true x-values of the future observed y-values, in the approach of [19,20] given in the last paragraph. Hence, it is natural to adopt the approach of [19,20] to construct confidence sets for the unknown true classes of future objects with the ‘

γ

-probability-

(1 - α)

-proportion’ property, that is, the probability, with respect to the randomness in

T

, is at least

γ

that at least

(1 - α)

proportion of all the confidence sets constructed from the same

T

do include the unknown true classes of all future objects.

3. Known $μ_{i}$ and $Σ_{i}$

In this section, the values of

μ_{i}

and

Σ_{i}

are assumed to be known, which helps to motivate and understand the confidence sets constructed in Section 4 for the more realistic situation where the values of

μ_{i}

and

Σ_{i}

are unknown. Since

μ_{i}

and

Σ_{i}

are known, no training data set

T

is required to estimate

μ_{i}

and

Σ_{i}

. Hence, the confidence sets in this section are denoted as

C (y)

, without the subscript

T

.

If

y

is from the lth class, then

y \sim N (μ_{l}, Σ_{l})

and so

{(y - μ_{l})}^{T} Σ_{l}^{- 1} (y - μ_{l})

has the chi-square distribution

χ_{p}^{2}

with p degrees of freedom. We construct a

1 - α

confidence set for the class c of the observed

y

by using [21] method of inverting a family of

1 - α

acceptance sets for testing

H_{0} : c - l

for each number l in C. Specifically, the acceptance set for

H_{0} : c - l

is given by

A_{l} = \{y \in R^{p} : {(y - μ_{l})}^{T} Σ_{l}^{- 1} (y - μ_{l}) \leq λ\},

(1)

where

λ = χ_{p, 1 - α}^{2}

is the

1 - α

quantile of the

χ_{p}^{2}

distribution. It follows directly from Neyman’s method that the confidence set is given by

C (y) = \{l \in C : {(y - μ_{l})}^{T} Σ_{l}^{- 1} (y - μ_{l}) \leq λ\} .

(2)

It is straightforward to show, by using the Neyman–Pearson lemma, that the acceptance set

A_{l}

in Equation (1) is optimal in terms of having the smallest volume among all the

1 - α

acceptance sets for testing

H_{0} : c - l

.

As for the usual confidence sets, it is desirable that, among the confidence sets

C (y_{1}), C (y_{2}), \dots

for the corresponding unknown true classes

c_{1}, c_{2}, \dots \in C

of the infinitely many future

y_{j}

with distribution

N (μ_{c_{j}}, Σ_{c_{j}})

(

j = 1, 2, \dots

), at least

1 - α

proportion will contain the true

c_{j}

’s. That is, it is desirable that

\underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C (y_{j})\}} \geq 1 - α,

(3)

where

I_{A}

denotes the indicator function of set A and so

\frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C (y_{j})\}}

is the proportion among the N confidence sets

C (y_{j})

that contains the true classes

c_{j}

. It is shown in the Appendix A that the property in Equation (3) holds with equality.

The interpretation of the property in Equation (3) is similar to that of a standard confidence set. The noteworthy difference is that the confidence sets

C (y_{j})

are for possibly different parameters

c_{j}

(

j = 1, 2, \dots

). In addition, note that, for each j,

C (y_{j})

is a standard

1 - α

level confidence set for

c_{j}

, with

y_{j}

being the only source of randomness.

Figure 1 gives an illustrative example with

k = 3

,

p = 2

:

μ_{1} = (\begin{matrix} 5.01 \\ 3.43 \end{matrix}), μ_{2} = (\begin{matrix} 5.94 \\ 2.77 \end{matrix}), μ_{3} = (\begin{matrix} 6.59 \\ 2.97 \end{matrix}),

Σ_{1} = (\begin{matrix} 0.124, & 0.99 \\ 0.99, & 0.144 \end{matrix}), Σ_{2} = (\begin{matrix} 0.266, & 0.085 \\ 0.085, & 0.098 \end{matrix}), Σ_{3} = (\begin{matrix} 0.404, & 0.094 \\ 0.094, & 0.104 \end{matrix}),

and

α = 5 %

(and so

λ = χ_{2, 0.95}^{2} = 5.991

). Specifically, the acceptance set

A_{l}

in Equation (1) is represented in Figure 1 by the ellipsoidal region centred at

μ_{l}

, marked by ’+’,

l = 1, 2, 3

. If

y \in A_{l}

then l is an element of the confidence set

C (y)

given in Equation (1). Hence, the following four situations can occur. (a)

y

falls into only one

A_{l}

and so

C (y)

has a single class. For example, if

y \in A_{1} \cap A_{2}^{c} \cap A_{3}^{c}

, then

C (y) = {1}

, i.e.,

y

is classified as belonging to class 1. (b)

y

falls into two

A_{l}

’s but not the other one, and so

C (y)

contains two classes. For example, if

y \in A_{1} \cap A_{2} \cap A_{3}^{c}

, then

C (y) = {1, 2}

, i.e.,

y

is classified as belonging to possibly classes 1 or 2. (c)

y

falls into all the three

A_{l}

’s, i.e.,

y \in A_{1} \cap A_{2} \cap A_{3}

, and so

y

is classified as belonging to possibly all three classes. (d)

y

falls outside all the

A_{l}

’s, i.e.,

y \in A_{1}^{c} \cap A_{2}^{c} \cap A_{3}^{c}

, and so

C (y) = \emptyset

and

y

is classified as not belonging to any one of the three classes. There is nothing wrong with this last classification since this

y

is judged not to be from the class l by the acceptance set

A_{l}

for each l, though such a

y

must be rare in order to guarantee the property in Equation (3). On the other hand, since it is known that

y

is from one of the k classes, it is sensible to classify

y

according to any reasonable classifier, e.g., a Bayesian classifier illustrated in the next paragraph. As the resultant confidence set

C^{*} (y)

from this augmentation contains

C (y)

, the property in Equation (3) clearly still holds for

C^{*} (y)

.

Figure 1. The acceptance sets for the three classes with known

μ_{i}

and

Σ_{i}

.

For example, if

y = {(4.5, 2.0)}^{T}

, which is marked by ‘*’ in Figure 1, then

C (y) = \emptyset

. The Bayesian classifier and the augmented confidence set

C^{*} (y)

can be worked out in the following way. Assume a non-informative prior

π (1) = π (2) = π (3) = 1 / 3

about the class c of

y

, then the posterior probability of

y

belonging to class l is given by

p (l | y) = π (l) f (y, l) / p (y) = f (y, l) / [3 p (y)],

where

f (y, l)

is the probability density function of

N (μ_{l}, Σ_{l})

and

p (y)

is the marginal density of

y

and so does not depend on l. Hence, the Bayesian classifier classifies

y

to the class

i_{0}

that satisfies

f (y, i_{0}) = {max}_{l \in C} f (y, l)

. For

y = {(4.5, 2.0)}^{T}

, we have

f (y, 1) = 0.009

,

f (y, 2) = 0.612

and

f (y, 3) = 0.046

. Hence, the Bayesian classifier and the augmented confidence set

C^{*} (y)

classify

y

to class 2.

In this particular example,

A_{1}

and

A_{3}

do not intersect as seen from Figure 1 and so any future

y

will not be classified to be in both classes 1 and 3. This reflects the fact that the distributions of the classes 1 and 3 are quite different/separated and so easy to distinguish. On the other hand, the distributions of the classes 2 and 3 are similar and so hard to distinguish. As a result,

A_{2}

and

A_{3}

have a large overlap and hence many future

y

’s will be classified as belonging to both classes 2 and 3.

4. Unknown $μ_{i}$ and $Σ_{i}$

4.1. Methodology

Now, we consider the more realistic situation where both the values of

μ_{i}

and

Σ_{i}

are unknown and so need to be estimated from the training data set

T

, independent of the future observations

y_{j}

(

j = 1, 2, \dots

) whose classes

c_{j}

are unknown and need to be inferred.

The training data set

T = {x_{i 1}, \dots, x_{i n_{i}}; i = 1, \dots, k}

can be used to estimate

μ_{i}

and

Σ_{i}

in the usual way:

\hat{μ_{i}} = \frac{1}{n_{i}} \sum_{m = 1}^{n_{i}} x_{i m}

,

{\hat{Σ}}_{i} = \frac{1}{n_{i} - 1} \sum_{m = 1}^{n_{i}} (x_{i m} - \hat{μ_{i}}) {(x_{i m} - \hat{μ_{i}})}^{T}, i = 1, \dots, k

. It is known [38] that

\hat{μ_{i}} \sim N (μ_{i}, Σ_{i} / n_{i})

,

(n_{i} - 1) {\hat{Σ}}_{i} = \sum_{m = 1}^{n_{i} - 1} z_{i m} z_{i m}^{T}

with

z_{i 1}, \dots, z_{i (n_{i} - 1)}

being i.i.d.

N (0, Σ_{i})

random vectors independent of

\hat{μ_{i}}

.

Mimicking the confidence set in Equation (2), we construct the confidence set for the class c of

y

as:

C_{T} (y) = \{l \in C : {(y - \hat{μ_{l}})}^{T} {\hat{Σ}}_{l}^{- 1} (y - \hat{μ_{l}}) \leq λ\},

(4)

where

λ

is a suitably chosen critical constant whose determination is considered next.

As in Section 3, it is desirable that the proportion of the future confidence sets

C_{T} (y_{j})

(

j = 1, 2, \dots

) that include the true classes

c_{j}

(

j = 1, 2, \dots

) should be at least

1 - α

:

\underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α .

(5)

It is shown in the Appendix A that a sufficient condition for guaranteeing Inequality (5) is

inf_{c_{j} \in C} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α,

(6)

where

E_{y_{j} | T}

denotes the conditional expectation with respect to the random variable

y_{j}

conditioning on the training data set

T

(or, equivalently,

{({\hat{μ}}_{1}, {\hat{Σ}}_{1}), \dots, ({\hat{μ}}_{k}, {\hat{Σ}}_{k})}

).

Since the value of the expression on the left-hand side of the inequality in Inequality (6) depends on

T

and

T

is random, Inequality (6) cannot be guaranteed for each observed

T

; more detailed explanation on this is given in the Appendix A. We therefore guarantee Inequality (6) with a large (close to 1) probability

γ

with respect to the randomness in

T

, which is shown in the Appendix A to be equivalent to

P_{T} \{min_{1 \leq l \leq k} P_{w_{l} | u_{l}, {v_{l m}}} \{{(w_{l} - u_{l})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m} v_{l m}^{T})}^{- 1} (w_{l} - u_{l}) \leq λ\} \geq 1 - α\} \geq γ,

(7)

where

w_{l} \sim N (0, I_{p}), u_{l} \sim N (0, I_{p} / n_{l}), v_{l m} \sim N (0, I_{p}), m = 1, \dots, n_{l} - 1

(8)

and all the

w_{l}

’s,

u_{l}

’s and

v_{l m}

’s are independent. This in turn guarantees that

P_{T} \{\underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α\} \geq γ .

(9)

The interpretation of this statement is that, based on one observed training data set

T

, one constructs confidence sets

C_{T} (y_{j})

for the

c_{j}

’s of all future

y_{j}

(

j = 1, 2, \dots

) and claims that at least

1 - α

proportion of these confidence sets do contain the true

c_{j}

’s. Then, we are

γ

confident with respect to the randomness in the training data set

T

that the claim is correct.

It is noteworthy that for the classification problem considered in this paper a classifier is built from one training data set

T

and then used to classify a large number of future

y_{j}

’s. Hence, the randomness in both the training data set

T

and the future

y_{j}

’s need to be accounted for but in different ways. This is reflected in our approach by the two numbers

1 - α

and

γ

, analogous to the idea of [19,20] as pointed out in Section 2.

If we treat the two sources of randomness in

y

and

T

simultaneously on equal footing (instead of the approach given above), then it is straightforward to show that ([38], Section 5.2)

{(y - {\hat{μ}}_{c})}^{T} {\hat{Σ}}_{c}^{- 1} (y - {\hat{μ}}_{c}) \sim \frac{(n_{c} + 1) (n_{c} - 1) p}{n_{c} (n_{c} - p)} F_{p, n_{c} - p},

where c is the true class of

y

, and

F_{p, n_{c} - p}

denotes an F random variable with degrees of freedom p and

n_{c} - p

. It follows therefore from Neyman’s method that

C (T, y) = \{l \in C : {(y - {\hat{μ}}_{l})}^{T} {\hat{Σ}}_{l}^{- 1} (y - {\hat{μ}}_{l}) \leq \frac{(n_{l} + 1) (n_{l} - 1) p}{n_{l} (n_{l} - p)} f_{p, n_{l} - p, 1 - α}\}

(10)

is a

1 - α

confidence set for c, where

f_{p, n_{l} - p, 1 - α}

is the

1 - α

quantile of

F_{p, n_{l} - p}

. However, this confidence set has the following coverage frequency interpretation. Collect one training data set

T

and the feature

y

of one future object, both of which are then used to compute the confidence set

C (T, y)

for the class c of

y

; then, the frequency of the confidence sets that contain the true c’s is

1 - α

among a large number of confidence sets constructed in this way. Note that, in this construction, one training data set

T

is used only once with one future

y

to produce one confidence set

C (T, y)

, and so the randomness in one

T

and the randomness in one future

y

are treated on equal footing. This is clearly different from what is considered in this paper and how statistical classification is used in most applications: only one training data set

T

is used to construct a classifier, which is then used repeatedly in classification of a large number of future objects with observed

y

values. Hence, our proposed new method treats the two sources of randomness in

T

and future

y

’s differently.

4.2. Algorithm for Computing $λ$

We now consider how to compute the critical constant

λ

so that the probability

P_{T}

in Equation (7) is equal to

γ

. This is accomplished by simulation in the following way. From the distributions given in Equation (8), in the sth repeat of simulation,

s = 1, \dots, S

, generate independent

u_{l}^{s} \sim N (0, I_{p} / n_{l}), v_{l 1}^{s}, \dots, v_{l (n_{l} - 1)}^{s} \sim N (0, I_{p}); l = 1, \dots, k

and find the

λ = λ_{s}

so that

min_{1 \leq l \leq k} P_{w_{l} | u_{l}^{s}, {v_{l m}^{s}}} \{{(w_{l} - u_{l}^{s})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m}^{s} {v_{l m}^{s}}^{T})}^{- 1} (w_{l} - u_{l}^{s}) \leq λ_{s}\} = 1 - α .

(11)

Repeat this S times to get

λ_{1}, \dots, λ_{S}

and order these as

λ_{[1]} \leq \dots \leq λ_{[S]}

. It is well known [39] that

λ_{[γ S]}

converges to the required critical constant

λ

with probability one as

S \to \infty

. Hence,

λ_{[γ S]}

is used as the required critical constant

λ

for a large S value, 10,000 say.

To find the

λ_{s}

in Equation (11) for each s, we also use simulation in the following way. Generate independent random vectors

{w_{l q} : q = 1, \dots, Q; l = 1, \dots, k}

from

N (0, I_{p})

, where Q is the number of simulations for finding

λ_{s}

. For each l, denote

t_{l q}^{s} = {(w_{l q} - u_{l}^{s})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m}^{s} {v_{l m}^{s}}^{T})}^{- 1} (w_{l q} - u_{l}^{s}), q = 1 \dots, Q

and their ordered values as

t_{l [1]}^{s} \leq \dots \leq t_{l [Q]}^{s}

. Then, it is clear that

t_{l [(1 - α) Q]}^{s}

is the sample

1 - α

quantile of

{(w_{l} - u_{l}^{s})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m}^{s} {v_{l m}^{s}}^{T})}^{- 1} (w_{l} - u_{l}^{s})

in which only

w_{l}

is random, and so

t_{l [(1 - α) Q]}^{s}

converges to the population

1 - α

quantile

λ_{s l}

with probability one as

Q \to \infty

, where

λ_{s l}

satisfies

P_{w_{l} | u_{l}^{s}, {v_{l m}^{s}}} \{{(w_{l} - u_{l}^{s})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m}^{s} {v_{l m}^{s}}^{T})}^{- 1} (w_{l} - u_{l}^{s}) \leq λ_{s l}\} = 1 - α .

Hence,

{max}_{1 \leq l \leq k} t_{l [(1 - α) Q]}^{s}

converges to

{max}_{1 \leq l \leq k} λ_{s l} = λ_{s}

as

Q \to \infty

and is used as an approximation to

λ_{s}

for a large Q value, 10,000 say.

It is noteworthy that

λ

depends only on

γ, α, p, k, n_{1}, \dots, n_{k}

(and the numbers of simulations S and Q which determine the numerical accuracy of

λ

due to simulation randomness). One can download from [40] our R computer program ConfidenceSetClassifier.R that implements this simulation method of computing the critical constant

λ

. While it is expected that larger values of S and Q will produce a more accurate

λ

value, it must be pointed out that there is no easy way to assess how the accuracy of

λ

depends on the values of S and Q. One practical way is to compute several

λ

values using different random seeds in the simulation for given S and Q, which form a random sample from the population of possible

λ

values. These

λ

values provide information on the variability among the possible

λ

values produced by the simulation method, and so accuracy of

λ

due to simulation randomness. See more details in Section 5.

As in Section 3, the confidence set

C_{T} (y)

in Equation (4) may be empty for a

y

and so

y

is classified as not belonging to any of the c classes. As discussed in Section 3, there is nothing wrong with this, but it is sensible to classify such a

y

according to any reasonable classifier. The resultant confidence set

C_{T}^{*} (y)

from this augmentation contains

C_{T} (y)

, and so Inequality (9) still holds for

C_{T}^{*} (y)

.

5. An Illustrative Example

The famous iris data set introduced by [41] is used in this section to illustrate the method proposed in this paper. The data set is simple but serves the purpose of illustration nevertheless. It contains

k = 3

classes representing the three species/classes of Iris flowers (

1 =

setosa,

2 =

versicolor,

3 =

virginica), and has

n_{i} = 50

observations from each class in

T

. Each observation gives the measurements (in centimetres) of the four variables: sepal length and width, and petal length and width. The data set iris can be found in ([42], Chapter 10) for example, and is also in the R base package.

First, we assume that only the first two measurements, sepal length and width, are used for classification in order to easily illustrate the method since the acceptance sets

A_{l}, l = 1, 2, 3

are two-dimensional and so can be easily plotted in this case. Based on the fifty observations on

p = 2

measurements from each of the three classes, one can calculate that

\begin{matrix} {\hat{μ}}_{1} = (\begin{matrix} 5.01 \\ 3.43 \end{matrix}), {\hat{μ}}_{2} = (\begin{matrix} 5.94 \\ 2.77 \end{matrix}), {\hat{μ}}_{3} = (\begin{matrix} 6.59 \\ 2.97 \end{matrix}), \\ {\hat{Σ}}_{1} = (\begin{matrix} 0.124, & 0.099 \\ 0.099, & 0.144 \end{matrix}), {\hat{Σ}}_{2} = (\begin{matrix} 0.266, & 0.085 \\ 0.085, & 0.098 \end{matrix}) and {\hat{Σ}}_{3} = (\begin{matrix} 0.404, & 0.094 \\ 0.094, & 0.104 \end{matrix}) . \end{matrix}

In the example in Section 3, these are used as the known values of

μ_{i}

and

Σ_{i}

for the three classes. For

α = 5 %

and

γ = 95 %

, the critical constant

λ

in Equation (7) is computed by our R program to be 9.175 using

S = 10,000

and

Q = 10,000

. The confidence set

C_{T} (y)

in (4) is based on the acceptance sets

A_{l} = \{y \in R^{p} : {(y - {\hat{μ}}_{l})}^{T} {\hat{Σ}}_{l}^{- 1} (y - {\hat{μ}}_{l}) \leq λ\}, l = 1, 2, 3

, which are plotted in Figure 2 by the ellipsoidal region centred at

{\hat{μ}}_{l}

, marked by ‘+’,

l = 1, 2, 3

. These ellipsoidal regions are larger than, but have the same centers and shapes as, the corresponding ellipsoidal regions given in Figure 1 of Section 3. This reflects the fact that the underlying multivariate normal distributions have been estimated from the training data

T

in this case and so involve uncertainty, while the distributions in Section 3 are assumed to be known.

Figure 2. The acceptance sets for the three classes with estimated

μ_{i}

and

Σ_{i}

.

The index l is an element of the confidence set

C_{T} (y)

in Equation (3) if and only if

y \in A_{l}

. Hence, the following four situations can occur, similar to those in Section 3. (a)

y

falls into only one

A_{l}

and so

C_{T} (y)

has only one class. (b)

y

falls into two

A_{l}

’s but not the other one, and so

C_{T} (y)

contains two classes. (c)

y

falls into all the three

A_{l}

’s, i.e.,

y \in A_{1} \cap A_{2} \cap A_{3}

, and so

y

is classified as belonging to possibly all three classes. (d)

y

falls outside all the

A_{l}

’s, i.e.,

y \in A_{1}^{c} \cap A_{2}^{c} \cap A_{3}^{c}

, and so

C_{T} (y) = \emptyset

and

y

is classified as not belonging to any one of the three classes.

From Figure 2, it is clear that

A_{1} \cap A_{2} \cap A_{3} \neq \emptyset

and so for any future

y \in A_{1} \cap A_{2} \cap A_{3}

the confidence set

C_{T} (y) = {1, 2, 3}

that is,

y

is judged to be possibly from any of the three classes.

As in Section 3, if

y

does not belong to any

A_{l}, l = 1, 2, 3

, we compute the augmented confidence set

C_{T}^{*} (y)

by using, for example, the naive Bayesian classifier with a non-informative prior that classifies

y

to the class

i_{0}

that satisfies

\hat{f} (y, i_{0}) = {max}_{l \in C} \hat{f} (y, l)

, where

\hat{f} (\cdot, l)

is the multivariate normal density function of the lth class with

μ_{l}

and

Σ_{l}

replaced by the estimates

{\hat{μ}}_{l}

and

{\hat{Σ}}_{l}

, respectively.

To get some idea of how sensitive the critical constant

λ

is to the simulation numbers S and Q, we have computed

λ

for various

(S, Q)

with

γ = 0.95, α = 0.05, p = 2, k = 3

and

n_{1} = n_{2} = n_{3} = 50

on an ordinary Windows PC (Core (TM2) Due CPU P8400@2.26 GHz ). As it is expected that larger values of S and Q will produce more accurate

λ

value, the results given in Table 1 indicate that the

λ

value based on

(S, Q) = (10,000, 10,000)

, in comparison with the

λ

value based on

(S, Q) = (20,000, 20,000)

, is accurate to at least the first decimal place and so probably sufficiently accurate for most real problems.

Table 1. Constant

λ

and computation time

C T

for various

(S, Q)

.

Alternatively, one can compute several

λ

values for the given S and Q values using different random seeds to assess the accuracy of a

λ

value computed. For example, fourteen

λ

values based on

(S, Q) = (10,000, 10,000)

based on fourteen different random seeds are computed to be: 9.231, 9.188, 9.172, 9.223, 9.192, 9.178, 9.203, 9.191, 9.198, 9.225, 9.182, 9.189, 9.224, 9.181, which form a sample of observations from the population distribution of all possible values of

λ

. This sample can then be used to infer the population and, in particular, the standard deviation of the population which gives the variability (or accuracy) of one

λ

value from the population. The mean and standard deviation of this sample of fourteen observations are given by

9.198

and

0.0196

, respectively, and so the

λ

value based on

(S, Q) = (10,000, 10,000)

is expected to be within the range

9.198 \pm 3 \times 0.0196

using the “three-sigma” rule.

It is also worth emphasizing that only one

λ

needs to be computed based on the observed training dataset

T

which is then used for classifications of all future objects. Hence, one can always increase S and Q to achieve better accuracy of

λ

as required and computation time should not be of a great concern.

If all the four measurements are used in classification, then

p = 4

and the acceptance sets

A_{l} = \{y \in R^{p} : {(y - {\hat{μ}}_{l})}^{T} {\hat{Σ}}_{l}^{- 1} (y - {\hat{μ}}_{l}) \leq λ\}, l = 1, 2, 3

are four dimensional ellipsoidal balls and so cannot be drawn. Nevertheless, the confidence set

C_{T} (y)

in Equation (4) is still valid and can be computed easily for a given

y

. For

α = 5 %, γ = 95 %, p = 4, k = 3

and

n_{1} = n_{2} = n_{3} = 50

, the critical constant

λ

in Equation (8) is computed by our R program to be

λ = 14.367

using

S = 10,000

and

Q = 10,000

. Now, suppose a future Iris flower has measurements

y = (4.5, 3.5, 1.4, 0.27)

. Then, it is easy to check that

y \in A_{1}

since

{(y - {\hat{μ}}_{1})}^{T} {\hat{Σ}}_{1}^{- 1} (y - {\hat{μ}}_{1}) = 5.915 \leq λ

, while

y \notin A_{2} \cup A_{3}

since

{(y - {\hat{μ}}_{l})}^{T} {\hat{Σ}}_{l}^{- 1} (y - {\hat{μ}}_{l}) > λ

for both

l = 2

and 3. Hence, the confidence set

C_{T} (y)

in (4) is

{1}

, that is, this Iris flower is classified as from class 1, i.e., Setosa.

6. A Simulation Study

In this section, a simulation study is carried out to illustrate the desirable feature of the confidence-set based classifier (CS) proposed in this paper, and to highlight its differences from the following popular classifiers: classification tree (CT, implemented using R package tree), multinomial logistic regression (MLR, implemented using R package nnet), support vector machine (SVM, implemented using R package e1071) and naive Bayes (NB, implemented using R package e1071). The setting

k = 3, p = 2, n_{1} = n_{2} = n_{3} = 50, γ = 0.95

and

α = 0.05

is considered following the illustrative example in the last section.

Three configurations of the

k = 3

classes are considered in the simulation study. For the

μ_{i}

and

Σ_{i}

given in the example in Section 3, the first configuration (CONF1) has the

k = 3

normal distributions given by

N (μ_{1} + {(0, - 0.5)}^{T}, Σ_{1})

,

N (μ_{2}, Σ_{2})

and

N (μ_{3}, Σ_{3})

. The second configuration (CONF2) has the distributions

N (μ_{1}, Σ_{1})

,

N (μ_{2}, Σ_{2})

and

N (μ_{3}, Σ_{3})

. The third configuration (CONF3) has the distributions

N (μ_{1}, Σ_{1})

,

N (μ_{2}, Σ_{2})

and

N (μ_{3} + {(1.0, 0.5)}^{T}, Σ_{3})

. CONF1 represents the situation that all the

k = 3

classes are quite similar and thus hard to distinguish. CONF2 represents the situation that two of the

k = 3

classes (i.e., classes 2 and 3) are quite similar but quite different from the other class (i.e., class 1). In CONF3, all the

k = 3

classes are quite different and thus relatively easy to distinguish in comparison to CONF1 and CONF2.

For each configuration of the three population distributions, a random sample of size

n_{i} = 50

is generated from each class/distribution to form the training data set

T

which is then used to train the classifiers CS, CT, MLR, SVM and NB. Each classifier is then used to classify

N = 3000

future objects, with 1000 generated from each of the three classes/distributions; the proportion of correct classification,

ζ

, of the

N = 3000

objects is recorded. For CS, the average size M of the confidence sets for the

N = 3000

objects is also recorded; note that all the other classifiers classify each future object to only one class. This process is repeated for 100 times to produce

ζ_{1}, \dots, ζ_{100}

for each classifier, and

M_{1}, \dots, M_{100}

for CS only. Denote

\hat{γ} = \frac{1}{100} \sum_{i = 1}^{100} I_{{ζ_{i} \geq 1 - α}}

and

\bar{ζ} = \frac{1}{100} \sum_{i = 1}^{100} ζ_{i}

for each classifier, and

\bar{M} = \frac{1}{100} \sum_{i = 1}^{100} M_{i}

for CS. The results on

\hat{γ}

,

\bar{ζ}

and

\bar{M}

are given in Table 2, with the corresponding standard deviations given in brackets. One can download from [40] our R computer program SimulationStudyF.R that implements this simulation study.

Table 2. Simulation results. Abbreviations are defined in the text.

Due to the property in Inequality (9) of CS, one expects that

\hat{γ} \geq γ = 0.95

for CS. This is indeed the case for each of the three configurations from the results in Table 2. Note, however, that

\hat{γ}

is either equal or close to zero for all the other classifiers. This is the advantage of CS, by construction, over the other classifiers. To guarantee the property in Inequality (9), the size of the confidence set may be larger than one as indicated by the

\bar{M}

values in Table 2, while all the other classifiers select only one class for each future object. The average size of the confidence set depends on the configuration of the

k = 3

classes. As expected,

\bar{M}

tends to be smaller when the

k = 3

classes are easier to distinguish, but larger when the

k = 3

classes are harder to distinguish. For example, CONF3 has a considerably smaller

\bar{M}

than CONF1.

As CS has the property in Inequality (9), it is not surprising that

\bar{ζ}

is likely to be larger than

1 - α = 0.95

, which is born out by the results in Table 2. However, for the other classifiers, the value of

\bar{ζ}

depends on how different the

k = 3

classes are;

\bar{ζ}

tends to be larger when the

k = 3

classes are more different and thus easier to distinguish. For example, CONF3 has a larger

\bar{ζ}

than CONF1.

7. Conclusions

This paper considers how to deal with the classification problem using the novel confidence set approach by adapting the idea of [19,20] for inference about the predictor values of the observed response values in a standard linear regression model. Specifically, confidence sets

C_{T} (y_{j})

for the true classes

c_{j}

of infinitely many future objects

y_{j}

(

j = 1, 2, \dots

), based on one training data set

T

, have been constructed so that, with confidence level

γ

about the randomness in

T

, the proportion of the

C_{T} (y_{j})

’s that contain the true

c_{j}

’s is at least

1 - α

.

The intuitive motivation underlying this method is that, when an object is judged to be possibly from several classes, we should accept this objectively rather than forcing ourselves to pick just one class, which entails a large chance of misclassification. By allowing an object to be classified as possibly from more than one class, the proportion of correct classification can be guaranteed to be at least

1 - α

with a large probability

γ

about the randomness in the training data set

T

. This ‘guaranteed probability

γ

about the randomness in

T

’ should be intuitive too since a

T

that is very misleading about the k classes will likely produce a classifier that makes many wrong classifications, and so only

γ

proportion of well behaved

T

will produce a classifier that give at least

1 - α

future correct classifications.

The two sources of randomness, those in the training data

T

and in future objects

y_{j}

, have been treated differently to reflect the fact that a classifier is built from one training data set

T

and then used to classify many future objects

y_{j}

. If the two sources of randomness are treated on equal footing, then the confidence set in Equation (10) should be used, which has a very different coverage frequency interpretation.

In this paper, the objects

y

from each class are assumed to follow a multivariate normal distribution. How the proposed method can be generalized to, or may be affected by, non-normal distributions, such as the elliptically contoured distribution [38] (p. 47) is interesting and warrants further research.

A frequentist approach is proposed in this paper. One wonders whether a corresponding Bayesian approach is easier to construct. In a Bayesian approach, one uses the posterior distribution

π (c_{j} | y_{j}, T)

to make an inference about the true class

c_{j}

of the future object

y_{j}

. In particular, one can easily construct a Bayesian credible set

C_{B} (y_{j})

for

c_{j}

such that

P {c_{j} \in C_{B} (y_{j}) | y_{j}, T} \geq 1 - α

. However, it is not at all clear whether this construction guarantees that

\underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{B} (y_{j})\}} \geq 1 - α

since it can be shown that

π (c_{i}, c_{j} | y_{i}, y_{j}, T) \neq π (c_{i} | y_{i}, T) π (c_{j} | y_{j}, T)

, i.e., the posterior distributions of

c_{i}

and

c_{j}

for the two future objects

y_{i}

and

y_{j}

are not independent. Nevertheless, Bayesian approach warrants further research.

Author Contributions

W.L., F.B., N.S., J.P., and A.J.H. all contributed to the writing of the paper, the data analysis, and the implementation of the simulations study.

Funding

The authors declare no funding.

Acknowledgments

We would like to thank the referees for critical and constructive comments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Mathematical Details

In this appendix, we first show that the property in Equation (3) holds with equality. Note that we have

\begin{matrix} lim_{N \to \infty} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C (y_{j})\}} \\ = & lim_{N \to \infty} \frac{1}{N} \sum_{j = 1}^{N} P \{c_{j} \in C (y_{j})\} \\ = & lim_{N \to \infty} \frac{1}{N} \sum_{j = 1}^{N} P \{{(y_{j} - μ_{c_{j}})}^{T} Σ_{c_{j}}^{- 1} (y_{j} - μ_{c_{j}}) \leq λ\} \\ = & lim_{N \to \infty} \frac{1}{N} \sum_{j = 1}^{N} (1 - α) = 1 - α, \end{matrix}

where the first equality above follows from the classical strong law of large numbers ([43], p. 333), the second from the definition of

C (y_{j})

in Equation (2), and the third from

λ = χ_{p, 1 - α}^{2}

. This completes the proof.

Next, we show that

inf_{c_{j} \in C} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α implies \underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α,

(A1)

where

E_{y_{j} | T}

denotes the conditional expectation with respect to the random variable

y_{j}

conditioning on the training data set

T

(or, equivalently, all the

{\hat{μ}}_{i}

and

{\hat{Σ}}_{i}

). We have from the classical strong law of large numbers [43] that

lim_{N \to \infty} \frac{1}{N} \sum_{j = 1}^{N} [I_{\{c_{j} \in C_{T} (y_{j})\}} - E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}}] = 0,

in which the conditional expectation

E_{y_{j} | T}

is used since all the confidence sets

C_{T} (y_{j})

(

j = 1, 2, \dots

) use the same training data set

T

. Hence,

\begin{matrix} \underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}} \\ = & lim_{N \to \infty} \frac{1}{N} \sum_{j = 1}^{N} [I_{\{c_{j} \in C_{T} (y_{j})\}} - E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}}] + \underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}} \\ = & \underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}} . \end{matrix}

The required result in Expression (12) now follows immediately from

\frac{1}{N} \sum_{j = 1}^{N} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq inf_{c_{j} \in C} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}}

for any

N \geq 1

since it is known that all the

c_{j}

’s are in C. This completes the proof.

Next, we provide a more tractable expression for

{inf}_{c_{j} \in C} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}}

in order to understand why Inequality (6) cannot be guaranteed for each observed

T

. From the definition of

C_{T} (y)

in Equation (4), we have

\begin{matrix} inf_{c_{j} \in C} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}} \\ = & inf_{c_{j} \in C} P_{y_{j} | T} \{c_{j} \in C_{T} (y_{j})\} \\ = & inf_{c_{j} \in C} P_{y_{j} | T} \{{(y_{j} - {\hat{μ}}_{c_{j}})}^{T} {\hat{Σ}}_{c_{j}}^{- 1} (y_{j} - {\hat{μ}}_{c_{j}}) \leq λ\} \\ = & min_{1 \leq l \leq k} P_{w_{l} | u_{l}, {v_{l m}}} \{{(w_{l} - u_{l})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m} v_{l m}^{T})}^{- 1} (w_{l} - u_{l}) \leq λ\}, \end{matrix}

(A2)

where

\begin{matrix} w_{l} = Σ_{l}^{- 1 / 2} (y_{l} - μ_{c_{l}}) \sim N (0, I_{p}), \end{matrix}

(A3)

\begin{matrix} u_{l} = Σ_{l}^{- 1 / 2} ({\hat{μ}}_{l} - μ_{c_{l}}) \sim N (0, I_{p} / n_{l}), \end{matrix}

(A4)

\begin{matrix} v_{l m} = Σ_{l}^{- 1 / 2} z_{l m} \sim N (0, I_{p}), m = 1, \dots, n_{l} - 1, \end{matrix}

(A5)

with all the

w_{l}

’s,

u_{l}

’s and

v_{l m}

’s being independent. Note that

w_{l}

depends on the future observation

y_{l}

but not the training data set

T

, while

u_{l}

and

{v_{l m}}

depend on the training data set

T

but not the future observations.

Since the conditional probability in Equation (13) depends on the training data set

T

(via the random vectors

u_{l}

and

{v_{l m}}

), Inequality (6), for any given value of

λ

, cannot be guaranteed for each observed training data set

T

, i.e.,

u_{l}

and

{v_{l m}}

. For example, if the values of

T

are such that

{(w_{l} - u_{l})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m} v_{l m}^{T})}^{- 1} (w_{l} - u_{l})

is substantially larger than

λ

(for a given constant

λ

) for most possible values of

w_{l} \sim N (0, I_{p})

, then the conditional probability in Equation (13) is smaller than

1 / 2

and hence

1 - α \in (1 / 2, 1)

.

We therefore guarantee Inequality (6) with a large (close to 1) probability

γ

with respect to the randomness in

T

, which is, from Equation (13), clearly equivalent to

P_{T} \{min_{1 \leq l \leq k} P_{w_{l} | u_{l}, {v_{l m}}} \{{(w_{l} - u_{l})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m} v_{l m}^{T})}^{- 1} (w_{l} - u_{l}) \leq λ\} \geq 1 - α\} \geq γ,

(A6)

where

w_{l}

,

u_{l}

and

v_{l m}

are given in Equations (14)–(16).

References

Webb, A.R.; Copsey, K.D. Statistical Pattern Recognition, 3rd ed.; Wiley: New York, NY, USA, 2011. [Google Scholar]
Flach, P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
Theodoridis, S.; Koutroumbas, K. Pattern Recognition, 4th ed.; Academic Press: Cambridge, MA, USA, 2009. [Google Scholar]
Piegorsch, W.W. Statistical Data Analytics: Foundations for Data Mining, Informatics, and Knowledge Discovery; Wiley: New York, NY, USA, 2015. [Google Scholar]
Chow, C.K. On optimum recognition error and reject tradeoff. IEEE Trans. Inf. Theor. 1970, 16, 41–46. [Google Scholar] [CrossRef]
Bartlett, P.L.; Wegkamp, M.H. Classification with a reject option using a hinge loss. J. Mach. Learn. Res. 2008, 9, 1823–1840. [Google Scholar]
Yuan, M.; Wegkamp, M. Classification methods with reject option based on convex risk minimization. J. Mach. Learn. Res. 2010, 11, 111–130. [Google Scholar]
Yu, H.; Jeske, D.R.; Ruegger, P.; Borneman, J. A three-class neutral zone classifier using a decision—Theoretic approach with application to dan array analyses. J. Agric. Biolog. Environ. Stat. 2010, 15, 474–490. [Google Scholar] [CrossRef] [PubMed]
Ramaswamy, H.G.; Tewari, A.; Agarwal, S. Consistent Algorithms for Multiclass Classification with a Reject Option. Available online: https://arxiv.org/abs/1505.04137 (accessed on 24 June 2019).
Del Coz, J.J.; Diez, J.; Bahamonde, A. Learning nondeterministic classifiers. J. Mach. Learn. Res. 2009, 10, 2293–2293. [Google Scholar]
Lei, J. Classification with confidence. Biometrika 2014, 101, 755–769. [Google Scholar] [CrossRef]
Sadinle, M.; Lei, J.; Wasserman, L. Least Ambiguous Set-Valued Classifiers with Bounded Error Levels. Available online: https://arxiv.org/abs/1609.00451 (accessed on 24 June 2019).
Shafer, G.; Vovk, V. A tutorial on conformal prediction. J. Mach. Learn. Res. 2008, 9, 371–421. [Google Scholar]
Vovk, V.; Gammerman, A.; Shafer, G. Algorithm Learning in a Random World; Springer: Heidelberg, Germany, 2005. [Google Scholar]
Campi, M.C. Classification with guarnateed probability of error. Mach. Learn. 2010, 80, 63–84. [Google Scholar] [CrossRef]
Wilks, S.S. Determination of sample sizes for setting tolerance limits. Ann. Math. Stat. 1941, 14, 45–55. [Google Scholar] [CrossRef]
Giles, P.J.; Kipling, D. Normality of oligonucleotide microarray data and implications for parametric statistical analyses. Bioinformatics 2003, 19, 2254–2262. [Google Scholar] [CrossRef] [PubMed]
Hoyle, D.D.; Rattray, M.; Jupp, R.; Brass, A. Making sense of microarray data distributions. Bioinformatics 2002, 18, 576–584. [Google Scholar] [CrossRef] [PubMed]
Lieberman, G.J.; Miller, R.G., Jr. Simultaneous tolerance intervals in regression. Biometrika 1963, 50, 155–168. [Google Scholar] [CrossRef]
Lieberman, G.J.; Miller, R.G., Jr.; Hamilton, M.A. Simultaneous discrimination intervals in regression. Biometrika 1967, 54, 133–145. [Google Scholar] [CrossRef] [PubMed]
Neyman, J. Outline of a theory of statistical estimation based on the classical theory of probability. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Sci. 1937, 236, 333–380. [Google Scholar] [CrossRef]
Hayter, A.J.; Hsu, J.C. On the relationship between stepwise decision procedures and confidence sets. J. Am. Stat. Assoc. 1994, 89, 128–136. [Google Scholar] [CrossRef]
Lehmann, E.L. Testing Statistical Hypotheses, 2nd ed.; Wiley: New York, NY, USA, 1986. [Google Scholar]
Finner, H.; Strassburge, K. The partitioning principle: A powerful tool in multiple decision theory. Ann. Stat. 2002, 30, 1194–1213. [Google Scholar] [CrossRef]
Huang, Y.; Hsu, J.C. Hochberg’s step-up method: Cutting corners off Holm’s step-down method. Biometrika 2007, 94, 965–975. [Google Scholar] [CrossRef]
Uusipaikka, E. Confidence Intervals in Generalized Regression Models; CRC Press: Boca Raton, FL, USA, 2008. [Google Scholar]
Ferrari, D.; Yang, Y. Confidence sets for model selection by f-testing. Stat. Sin. 2015, 25, 1637–1658. [Google Scholar] [CrossRef]
Wan, F.; Liu, W.; Bretz, F.; Han, Y. An exact confidence set for a maximum point of a univariate polynomial function in a given interval. Technometrics 2015, 57, 559–565. [Google Scholar] [CrossRef]
Wan, F.; Liu, W.; Bretz, F.; Han, Y. Confidence sets for optimal factor levels of a response surface. Biometrics 2016, 72, 1285–1293. [Google Scholar] [CrossRef]
Scheffé, H. A statistical theory of calibration. Ann. Stat. 1973, 1, 1–37. [Google Scholar] [CrossRef]
Mee, R.W.; Eberhardt, K.R.; Reeve, C.P. Calibration and simultaneous tolerance intervals for regression. Technometrics 1991, 33, 211–219. [Google Scholar] [CrossRef]
Mathew, T.; Zha, W. Multiple use confidence regions in mul- tivariate calibration. J. Am. Stat. Assoc. 1997, 92, 1141–1150. [Google Scholar] [CrossRef]
Mathew, T.; Sharma, M.K.; Nordstrom, K. Tolerance regions and multiple-use confidence regions in multivariate calibration. Ann. Stat. 1998, 26, 1989–2013. [Google Scholar] [CrossRef]
Aitchison, T.C. Discussion of the paper ‘Multivariate Calibration’ by Brown. J. R. Stat. Soc. Ser. B (Methodol.) 1982, 44, 309–360. [Google Scholar]
Krishnamoorthy, K.; Mathew, T. Statistical Tolerance Regions: Theory, Applications and Computation; Wiley: New York, NY, USA, 2009. [Google Scholar]
Han, Y.; Liu, W.; Bretz, F.; Wan, F.; Yang, P. Statistical calibration and exact one-sided simultaneous tolerance intervals for polynomial regression. J. Stat. Plan. Inference 2016, 168, 90–96. [Google Scholar] [CrossRef]
Liu, W.; Han, Y.; Bretz, F.; Wan, F.; Yang, P. Counting by weighing: Know your numbers with confidence. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2016, 65, 641–648. [Google Scholar] [CrossRef]
Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 3rd ed.; Wiley: New York, NY, USA, 2003. [Google Scholar]
Serfling, R. Approximation Theorems of Mathematical Statistics; Wiley: New York, NY, USA, 1980. [Google Scholar]
Liu, W. R Computer Programs. Available online: http://www.personal.soton.ac.uk/wl/Classification/ (accessed on 24 June 2019).
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Martinez, W.L.; Martinez, A.R. Computational Statistics Handbook with Matlab, 2nd ed.; Chapman & Hall: Boca Raton, FL, USA, 2008. [Google Scholar]
Chow, Y.S.; Teicher, H. Probability Theory: Independence, Interchangeability, Martingales; Springer: Heidelberg, Germany, 1978. [Google Scholar]

Figure 1. The acceptance sets for the three classes with known

μ_{i}

and

Σ_{i}

.

Figure 1. The acceptance sets for the three classes with known

μ_{i}

and

Σ_{i}

.

Figure 2. The acceptance sets for the three classes with estimated

μ_{i}

and

Σ_{i}

.

Figure 2. The acceptance sets for the three classes with estimated

μ_{i}

and

Σ_{i}

.

Table 1. Constant

λ

and computation time

C T

for various

(S, Q)

.

Table 1. Constant

λ

and computation time

C T

for various

(S, Q)

.

$(S, Q)$	(1000, 1000)	(10,000, 10,000)	(10,000, 20,000)	(10,000, 50,000)	(20,000, 10,000)	(20,000, 20,000)
$λ$	9.289	9.175	9.199	9.204	9.187	9.198
$C T$	15 min	25 h	51 h	139 h	58 h	102 h

Table 2. Simulation results. Abbreviations are defined in the text.

	CS			CT		MLR		SVM		NB
	$\hat{γ}$	$\bar{ζ}$	$\bar{M}$	$\hat{γ}$	$\bar{ζ}$	$\hat{γ}$	$\bar{ζ}$	$\hat{γ}$	$\bar{ζ}$	$\hat{γ}$	$\bar{ζ}$
CONF1	1.00	0.98	2.06	0.00	0.71	0.00	0.77	0.00	0.76	0.00	0.74
		(0.0066)	(0.1075)		(0.0258)		(0.0099)		(0.0121)		(0.0110)
CONF2	1.00	0.98	1.67	0.00	0.74	0.00	0.80	0.00	0.80	0.00	0.79
		(0.0062)	(0.0552)		(0.0254)		(0.0085)		(0.0090)		(0.0095)
CONF3	1.00	0.98	1.31	0.00	0.90	0.10	0.94	0.07	0.94	0.01	0.94
		(0.0060)	(0.0716)		(0.0159)		(0.0052)		(0.0058)		(0.0068)

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

Confidence Sets for Statistical Classification

Abstract

1. Introduction

2. Preliminaries

3. Known μ i and Σ i

4. Unknown μ i and Σ i

4.1. Methodology

4.2. Algorithm for Computing λ

5. An Illustrative Example

6. A Simulation Study

7. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Mathematical Details

References

Article Metrics

Article Access Statistics

3. Known $μ_{i}$ and $Σ_{i}$

4. Unknown $μ_{i}$ and $Σ_{i}$

4.2. Algorithm for Computing $λ$