Confidence Sets for Statistical Classification (II): Exact Confidence Sets

Liu, Wei; Bretz, Frank; Hayter, Anthony J.

doi:10.3390/stats2040030

Open AccessArticle

Confidence Sets for Statistical Classification (II): Exact Confidence Sets

by

Wei Liu

^1,*

,

Frank Bretz

²

and

Anthony J. Hayter

³

¹

S3RI, School of Mathematics University of Southampton, Southampton SO17 1BJ, UK

²

Novartis Pharma AG, 4002 Basel, Switzerland

³

Department of Statistics and Operations Technology, University of Denver, Denver, CO 80208, USA

^*

Author to whom correspondence should be addressed.

Stats 2019, 2(4), 439-446; https://doi.org/10.3390/stats2040030

Submission received: 4 October 2019 / Revised: 18 October 2019 / Accepted: 1 November 2019 / Published: 7 November 2019

Download

Browse Figure

Review Reports Versions Notes

Abstract

Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences among others. Liu et al. (2019) proposed a confidence-set-based classifier that classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into possibly more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects. However, the classifier uses a conservative critical constant. In this paper, we show how to determine the exact critical constant in applications where prior knowledge about the proportions of the future objects from each class is available. As the exact critical constant is smaller than the conservative critical constant given by Liu et al. (2019), the classifier using the exact critical constant is better than the classifier by Liu et al. (2019) as expected. An example is provided to illustrate the method.

Keywords:

classification; confidence level; confidence set; coverage frequency; statistical inference

1. Introduction

Classification has applications in a wide range of fields including medicine, engineering, computer science and social sciences among others. For overviews, the reader is referred to the books by [1,2,3,4,5]. In the recent paper, Liu et al. (2019) [6] proposed a new classifier based on confidence sets. It constructs a confidence set for the the unknown parameter c, the true class of each future object, and classifies the object as belonging to the set of classes given by the confidence set. Hence, this approach classifies a future object into a single class only when there is enough evidence to warrant this, and into several classes otherwise. By allowing classification of an object into potentially more than one class, this classifier guarantees a pre-specified proportion of correct classification among all future objects with a pre-specified confidence

γ

about the randomness in the training data based on which the classifier is constructed.

However, the classifier of Liu et al. (2019) uses a conservative critical constant

λ

and so the resultant confidence sets may be larger than necessary. The purpose of this paper is to determine the exact critical constant

λ

and therefore to improve the classifier of Liu et al. (2019) in situations where one has prior knowledge about the proportions of the (infinite) future objects belonging to the k possible classes.

The layout of the paper is as follows. Section 2 gives a very brief review of the classifier of Liu et al. (2019), and then considers the determination of the exact critical constant

λ

under the additional knowledge/assumption given above. An illustrative example is given in Section 3 to demonstrate the advantage of the improved classifier proposed in this paper when the additional assumption holds. Section 4 contains conclusions and discussions. Finally, some mathematical details are provided in the Appendix A. As the same setting and notation as in the work by Liu et al. (2019) are used, it is recommended to read this paper in conjunction with the one by Liu et al. (2019).

2. Methodology

2.1. Methodology

Let the p-dimensional data vector

x_{l} = {(x_{l 1}, \dots, x_{l p})}^{T}

denote the feature measurement on an object from the lth class, which has multivariate normal distribution

N (μ_{l}, Σ_{l})

,

l = 1, \dots, k

; here, k denotes the total number of classes, which is a known number. The available training dataset is given by

T = {x_{l 1}, \dots, x_{l n_{l}}; l = 1, \dots, k}

, where

x_{l 1}, \dots, x_{l n_{l}}

are i.i.d. observations from the lth class with distribution

N (μ_{l}, Σ_{l})

,

l = 1, \dots, k

. The classification problem is to make inference about c, the true class of a future object, based on the feature measurement

y = {(y_{1}, \dots, y_{p})}^{T}

observed on the object, which is only known to belong to one of the k classes and so follows one of the k multivariate normal distributions. In statistical terminology, c is the unknown parameter of interest that takes a possible value in the simple parameter space

C = {1, \dots, k}

. We emphasize that c is treated as non-random in both the work of Liu et al. (2019) and here.

A classifier that classifies an object with measurement

y

into one single class in

C = {1, \dots, k}

can be regarded as a point estimator of c. The classifier of Liu et al. (2019) provides a set

C_{T} (y) \subseteq C

as plausible values of c. Depending on

y

and the training dataset

T

,

C_{T} (y)

may contain only a single value, in which case

y

is classified into one single class given by

C_{T} (y)

. When

C_{T} (y)

contains more than one value in C,

y

is classified as possibly belonging to the several classes given by

C_{T} (y)

. Hence, in statistical terms, the classifier uses the confidence set approach. The inherent advantage of the confidence set approach over the point estimation approach is the guaranteed

1 - α

proportion of confidence sets that contain the true classes.

Specifically, the set

C_{T} (y) \subseteq C

was constructed by Liu et al. (2019) as

C_{T} (y) = \{l \in C : {(y - \hat{μ_{l}})}^{T} {\hat{Σ}}_{l}^{- 1} (y - \hat{μ_{l}}) \leq λ\},

(1)

where

\hat{μ_{l}} = \frac{1}{n_{l}} \sum_{m = 1}^{n_{l}} x_{l m}

and

{\hat{Σ}}_{l} = \frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l}} (x_{l m} - \hat{μ_{l}}) {(x_{l m} - \hat{μ_{l}})}^{T}, l = 1, \dots, k

, are, respectively, the usual estimators of the unknown

μ_{l}

and

Σ_{l}

based on the training dataset

T = {x_{l 1}, \dots, x_{l n_{l}}; l = 1, \dots, k}

, and

λ

is a suitably chosen critical constant whose determination is considered next. The intuition behind the definition of

C_{T} (y)

in Equation (1) is that a future object

y

is likely to be from class l if and only if

{(y - \hat{μ_{l}})}^{T} {\hat{Σ}}_{l}^{- 1} (y - \hat{μ_{l}}) \leq λ

.

Note that the proportion of the future confidence sets

C_{T} (y_{j})

(

j = 1, 2, \dots

) that include the true classes

c_{j}

of

y_{j}

(

j = 1, 2, \dots

) is given by

{lim inf}_{N \to \infty} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}}

. Thus, it is desirable that

\underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α

(2)

where

1 - α

is a pre-specified large (close to 1) proportion, e.g., 0.95. While the constraint in Equation (2) is difficult to deal with, Liu et al. (2019) showed that a sufficient condition for guaranteeing Equation (2) is

inf_{c_{j} \in C} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α

(3)

where

E_{y_{j} | T}

denotes the conditional expectation with respect to the random variable

y_{j}

conditioning on the training dataset

T

(or, equivalently,

{({\hat{μ}}_{1}, {\hat{Σ}}_{1}), \dots, ({\hat{μ}}_{k}, {\hat{Σ}}_{k})}

).

Since the value of the expression on the left hand side of the inequality in Equation (3) (and in Equation (2) as well) depends on

T

and

T

is random, the inequality in Equation (3) cannot be guaranteed for each observed

T

. We therefore guarantee Equation (3) with a large (close to 1) probability

γ

with respect to the randomness in

T

:

P_{T} \{inf_{c_{j} \in C} E_{y_{j} | T} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α\} = γ,

(4)

which in turn guarantees that

P_{T} \{\underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α\} \geq γ .

(5)

Computer code in R was provided by Liu et al. (2019) to compute the

λ

that solves Equation (4), which allows the confidence sets

C_{T} (y_{j})

in Equation (1) to be constructed for each future object.

The interpretation of Equations (5) and (6) below is that, based on one observed training dataset

T

, one constructs confidence sets

C_{T} (y_{j})

for the

c_{j}

s of all future

y_{j}

(

j = 1, 2, \dots

) and claims that at least

1 - α

proportion of these confidence sets do contain the true

c_{j}

s. Then, we are

γ

confident with respect to the randomness in the training dataset

T

that the claim is correct.

A natural question is how to find the exact critical constant

λ

that solves the equation

P_{T} \{\underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}} \geq 1 - α\} = γ

(6)

which is an improvement to the conservative

λ

that solves Equation (4) as given by Liu et al. (2019). Next, we show how to find the exact critical constant

λ

under an additional assumption which is satisfied in some applications.

Assume that, among the N future objects that need to be classified,

N_{l}

objects are actually from the lth class with the distribution

N (μ_{l}, Σ_{l})

,

l = 1, \dots, k

. The additional assumption we make is that

lim_{N \to \infty} \frac{N_{l}}{N} = r_{l}, l = 1, \dots, k

(7)

where the

r_{l}

s are assumed to be known constants in the interval

[0, 1]

. Intuitively, this assumption means that we know the proportions of the future objects that belong to each of the k classes, even though we do not know the true class of each individual future object.

The assumption in Equation (7) is reasonable in some applications. For example, when screening for a particular disease among a specific population for preventive purpose, there are

k = 2

classes: having the disease (

l = 1

) or not having the disease (

l = 2

). If we know the prevalence of the disease, d, in the overall population, then

r_{1} = d

and

r_{2} = 1 - d

, even though we do not know whether an individual subject has the disease or not.

It is shown in the Appendix A that, under the assumption in Equation (7), Equation (6) is equivalent to

P_{u_{l}, {v_{l m}}} \{\sum_{l = 1}^{k} r_{l} P_{w_{l} | u_{l}, {v_{l m}}} \{{(w_{l} - u_{l})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m} v_{l m}^{T})}^{- 1} (w_{l} - u_{l}) \leq λ\} \geq 1 - α\} = γ

(8)

where

w_{l} \sim N (0, I_{p}), u_{l} \sim N (0, I_{p} / n_{l}), v_{l m} \sim N (0, I_{p}), m = 1, \dots, n_{l} - 1

(9)

and all the

w_{l}

s,

u_{l}

s and

v_{l m}

s are independent,

P_{w_{l} | u_{l}, {v_{l m}}} {\cdot}

denotes the conditional probability about

w_{l}

conditioning on

(u_{l}, {v_{l m}})

, and

P_{u_{l}, {v_{l m}}} {\cdot}

denotes the probability about

(u_{l}, {v_{l m}})

.

2.2. Algorithm for Computing the Exact $λ$

We now consider how to compute the critical constant

λ

that solves Equation (8). Similar to Liu et al. (2019), this is accomplished by simulation in the following way. From the distributions given in Equation (9), in the sth repeat of simulation,

s = 1, \dots, S

, generate independent

u_{l}^{s} \sim N (0, I_{p} / n_{l}), v_{l 1}^{s}, \dots, v_{l (n_{l} - 1)}^{s} \sim N (0, I_{p}); l = 1, \dots, k .

and find the

λ = λ_{s}

so that

\sum_{l = 1}^{k} r_{l} P_{w_{l} | u_{l}^{s}, {v_{l m}^{s}}} \{{(w_{l} - u_{l}^{s})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m}^{s} {v_{l m}^{s}}^{T})}^{- 1} (w_{l} - u_{l}^{s}) \leq λ_{s}\} = 1 - α .

(10)

Repeat this S times to get

λ_{1}, \dots, λ_{S}

and order these as

λ_{[1]} \leq \dots \leq λ_{[S]}

. It is well known (cf. [7]) that

λ_{[γ S]}

converges to the required critical constant

λ

with probability one as

S \to \infty

. Hence,

λ_{[γ S]}

is used as the required critical constant

λ

for a large S value, e.g., 10,000.

To find the

λ_{s}

in Equation (10) for each s, we use simulation in the following way. Generate independent random vectors

{w_{l q} : q = 1, \dots, Q; l = 1, \dots, k}

from

N (0, I_{p})

, where Q is the number of simulations for finding

λ_{s}

. For each given value of

λ_{s} > 0

, the expression on the left-side of Equation (10) can be computed by approximating each of the k probabilities involved using the corresponding proportions out of the Q simulations. It is also clear that this expression is monotone increasing in

λ_{s}

. Hence, the

λ_{s}

that solves Equation (10) can be found by using a searching algorithm; for example, the bi-section method is used in our R code. To approximate reasonably accurately the probabilities with the proportions, a large Q value, e.g., 10,000, should be used.

It is noteworthy from Equations (8) and (9) that

λ

depends only on

γ, α, p, k, n_{1}, \dots, n_{k}, r_{1}, \dots, r_{k}

(and the numbers of simulations S and Q, which determine the numerical accuracy of

λ

due to simulation randomness). It is also worth emphasizing that only one

λ

needs to be computed based on the observed training dataset

T

, which is then used for constructing the confidence sets

C_{T} (y_{j})

and classifying accordingly all future objects.

It is expected that larger values of S and Q will produce more accurate

λ

value, one can use the method discussed by Liu et al. (2019) to assess how the accuracy of

λ

depends on the values of S and Q. Similar to the work by Liu et al. (2019), it is recommended to set

S =

10,000 and

Q =

10,000 for reasonable computation time and accuracy of

λ

due to simulation randomness.

3. An Illustrative Example

As in the work of Liu et al. (2019), the famous iris dataset introduced by Fisher (1936) [8] is used in this section to illustrate the method proposed in this paper. The dataset contains

k = 3

classes representing the three species/classes of Iris flowers (1 = setosa; 2 = versicolor; and 3 = virginica), and has

n_{i} = 50

observations from each class in

T

. Each observation gives the measurements (in centimeters) of the four variables: sepal length and width, and petal length and width.

We focus on the case that only the first two measurements, sepal length and width, are used for classification in order to easily illustrate the method since the acceptance sets

A_{l} = \{y \in R^{p} : {(y - \hat{μ_{l}})}^{T} {\hat{Σ}}_{l}^{- 1} (y - \hat{μ_{l}}) \leq λ\}, l = 1, 2, 3

are two-dimensional and thus can be easily plotted in this case. Based on the fifty observations on

p = 2

measurements from each of the three classes, the

{\hat{μ}}_{l}

and

{\hat{Σ}}_{l}

were given by Liu et al. (2019).

For

α = 5 %

and

γ = 95 %

, the critical constant

λ

that solves Equation (4) was computed by Liu et al. (2019) to be

λ_{c o n} = 9.175

using

S =

10,000 and

Q =

10,000. The corresponding acceptance sets, based on which the confidence set

C_{T} (y)

in Equation (1) can be constructed directly (cf. [6]), are given by

A_{l}^{c o n} = \{y \in R^{p} : {(y - {\hat{μ}}_{l})}^{T} {\hat{Σ}}_{l}^{- 1} (y - {\hat{μ}}_{l}) \leq λ_{c o n}\}, l = 1, 2, 3

and plotted in Figure 1 by the dotted ellipsoidal region centered at

{\hat{μ}}_{l}

, marked by “+”.

Now, assume that we have the knowledge about the proportions of the three species among all the Iris flowers

(r_{1}, r_{2}, r_{3})

and the Iris flowers that need to be classified reflect this composition. For the same

α = 5 %

,

γ = 95 %

,

S =

10,000 and

Q =

10,000 and with, for example,

(r_{1}, r_{2}, r_{3}) = (0.3, 0.4, 0.3)

, the exact critical constant

λ

that solves Equation (6) is computed by our R program to be

λ_{e x a} = 7.737

. As expected,

λ_{e x a}

is smaller than

λ_{c o n}

and, as a result, the corresponding confidence set

C_{T} (y)

in Equation (1) with

λ = λ_{e x a}

and acceptance sets

A_{l}^{e x a} = \{y \in R^{p} : {(y - {\hat{μ}}_{l})}^{T} {\hat{Σ}}_{l}^{- 1} (y - {\hat{μ}}_{l}) \leq λ_{e x a}\}, l = 1, 2, 3

, are also smaller than the

A_{l}^{c o n}

given by Liu et al. (2019).

The acceptance sets

A_{l}^{e x a}, l = 1, 2, 3

are plotted in Figure 1 by the solid ellipsoidal regions. For example, if a future object has

y = (4.79, 2.35)

, marked by a solid dot in Figure 1, then the conservative confidence set of Liu et al. (2019) classifies the object as from Classes 2 and 3 since this

y

belongs to both

A_{2}^{c o n}

and

A_{3}^{c o n}

. However, the new exact confidence set of this paper classifies the object as from Class 2 only since this

y

belongs to

A_{2}^{e x a}

but not

A_{1}^{e x a}

or

A_{3}^{e x a}

. This demonstrates the advantage of the new confidence set using

λ_{e x a}

in this paper over the conservative confidence set using

λ_{c o n}

by Liu et al. (2019). We have also computed the value of

λ_{e x a}

for several other given

(r_{1}, r_{2}, r_{3})

. For example,

λ_{e x a} = 7.706

for

(r_{1}, r_{2}, r_{3}) = (1 / 3, 1 / 3, 1 / 3)

,

λ_{e x a} = 7.865

for

(r_{1}, r_{2}, r_{3}) = (0.1, 0.45, 0.45)

, and

λ_{e x a} = 8.019

for

(r_{1}, r_{2}, r_{3}) = (0.1, 0.7, 0.2)

. The conservative

λ_{c o n} = 9.175

is considerably, ranging from 14% to 19%, larger than these

λ_{e x a}

values.

One can download from http://www.personal.soton.ac.uk/wl/Classification/ the R computer program ExactConfidenceSetClassifier.R that implements this simulation method of computing the critical constant

λ_{e x a}

. The computation of one

λ_{e x a}

using

(S, Q) =

(10,000, 10,000) takes about 13 h on an ordinary Window’s PC (Core(TM2) Duo CPU P8400@2.26 GHz).

However, it must be emphasized the new confidence set is valid only if the assumption in Equation (7) is true. If the assumption does not hold, then the conservative confidence set of Liu et al. (2019) should be used in order for the statement in Equation (5) to hold.

4. Conclusions

The probability statement in Equation (5) allows that the confidence sets by Liu et al. (2019) have the nice interpretation that, with confidence level

γ

about the randomness in the training dataset

T

, at least

1 - α

proportion of the confidence sets

C_{T} (y_{j})

,

j = 1, 2, \dots

contain the true classes

c_{j}

,

j = 1, 2, \dots

of the future objects

y_{j}

,

j = 1, 2, \dots

. However, the confidence set given by Liu et al. (2019) is conservative in that the

λ

in the confidence set in Equation (1) is computed to solve the equation in Equation (4), which implies the constraint in Equation (5). This paper considers how to compute the

λ

in the confidence set in Equation (1) so that the probability in Equation (5) is equal to

γ

, i.e. from the Equation (6). The confidence sets using the

λ

that solves the Equation (6) have the confidence level equal to

γ

and so are exact. We show that this can be accomplished under the extra assumption given in Equation (7), which may be sensible in some applications.

As the

λ_{e x a}

that solves Equation (6) is smaller than the

λ_{c o n}

that solves Equation (4) used by Liu et al. (2019), the new confidence sets are smaller and so better than the confidence sets given by Liu et al. (2019).

One wonders whether there are other sensible assumptions that allow the

λ

to be solved from Equation (6). This warrants further research.

If

C_{T} (y)

for a future object

y

is empty then, since

y

must be from one of the k classes,

C_{T} (y)

can be augmented to include the class that has the largest posterior probability using the naive Bayesian classifier as in the work by Liu et al. (2019). The probability statement in Equation (5) clearly holds under this augmentation to

C_{T} (y)

only when

C_{T} (y)

is empty.

There are applications in which information about the proportions

r_{l}

would be known with uncertainty. For example, the training set may be a representative sample from the population and as such the proportion of each class can be estimated, or the proportions might have been estimated by a previous independent auxiliary dataset. If one replaces the

r_{l}

s in Equation (8) by these estimates then the

λ

solved in Equation (8) will depend on these estimates and so be random. As a result, the probability statement in Equation (5) is no longer valid. How to deal with these applications warrants further research.

Finally, the classifier of Liu et al. (2019) is developed from the idea of Lieberman et al. [9,10]. The same idea was also used by, for example, Mee et al. (1991) [11], Han et al. (2016) [12], Liu et al. (2016) [13] and Peng et al. (2019) [14], who all used conservative critical constants as did Liu et al. (2019). The idea of this paper can be applied to all these works to compute exact critical constants under suitable extra assumptions.

Author Contributions

Methodology, software, F.B., A.J.H., W.L.; software, W.L.

Acknowledgments

We would like to thank the referees for critical and constructive comments on the earlier version of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Mathematical Details

In this appendix, we show the equivalence of Equations (6) and (8) under the assumption in Equation (7). Note first the well known fact (cf. [15]) that

\hat{μ_{l}} \sim N (μ_{l}, Σ_{l} / n_{l})

,

(n_{l} - 1) {\hat{Σ}}_{l} = \sum_{m = 1}^{n_{l} - 1} z_{l m} z_{l m}^{T}

with

z_{l 1}, \dots, z_{l (n_{l} - 1)}

being i.i.d.

N (0, Σ_{l})

random vectors independent of

\hat{μ_{l}}

.

Among the N future objects that need to be classified, let

N_{l}

be the number of objects actually from the lth class with the feature measurements denoted as

y_{l 1}, \dots, y_{l N_{l}}

,

l = 1, \dots, k

. Clearly, we have

N_{1} + \dots + N_{k} = N

and

\begin{matrix} \underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}} \\ = & \underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{l = 1}^{k} \sum_{i = 1}^{N_{l}} I_{\{c_{l} \in C_{T} (y_{l i})\}} \\ = & \underset{N \to \infty}{lim inf} \sum_{l = 1}^{k} \frac{N_{l}}{N} (\frac{1}{N_{l}} \sum_{i = 1}^{N_{l}} I_{\{c_{l} \in C_{T} (y_{l i})\}}) . \end{matrix}

(A1)

We have from the classical strong law of large numbers (cf. [16]) that

lim_{N_{l} \to \infty} \frac{1}{N_{l}} \sum_{i = 1}^{N_{l}} [I_{\{c_{l} \in C_{T} (y_{l i})\}} - E_{y_{l i} | T} I_{\{c_{l} \in C_{T} (y_{l i})\}}] = 0,

(A2)

in which the conditional expectation

E_{y_{l i} | T}

is used since all the confidence sets

C_{T} (y_{l i})

(

i = 1, \dots, N_{l}

) use the same training dataset

T

. By noting that

y_{l i}, i = 1, \dots, N_{l}

are from the lth class and thus have the same distribution

N (μ_{l}, Σ_{l})

, we have from the definition of

C_{T} (y)

in Equation (1) that

\begin{matrix} E_{y_{l i} | T} I_{\{c_{l} \in C_{T} (y_{l i})\}} \\ = & P_{y_{l 1} | T} \{c_{l} \in C_{T} (y_{l 1})\} \\ = & P_{y_{l 1} | T} \{{(y_{l 1} - {\hat{μ}}_{l})}^{T} {\hat{Σ}}_{l}^{- 1} (y_{l 1} - {\hat{μ}}_{l}) \leq λ\} \\ = & P_{w_{l} | u_{l}, {v_{l m}}} \{{(w_{l} - u_{l})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m} v_{l m}^{T})}^{- 1} (w_{l} - u_{l}) \leq λ\} \end{matrix}

(A3)

where

\begin{matrix} w_{l} = Σ_{l}^{- 1 / 2} (y_{l 1} - μ_{l}) \sim N (0, I_{p}) \\ u_{l} = Σ_{l}^{- 1 / 2} ({\hat{μ}}_{l} - μ_{l}) \sim N (0, I_{p} / n_{l}) \\ v_{l m} = Σ_{l}^{- 1 / 2} z_{l m} \sim N (0, I_{p}), m = 1, \dots, n_{l} - 1 \end{matrix}

with all the

w_{l}

s,

u_{l}

s and

v_{l m}

s being independent. Note that

w_{l}

depends on the future observation

y_{l 1}

but not the training dataset

T

, while

u_{l}

and

{v_{l m}}

depend on the training dataset

T

but not the future observations.

Combining the assumption in Equation (7) with Equations (A1)–(A3) gives

\underset{N \to \infty}{lim inf} \frac{1}{N} \sum_{j = 1}^{N} I_{\{c_{j} \in C_{T} (y_{j})\}} = \sum_{l = 1}^{k} r_{l} P_{w_{l} | u_{l}, {v_{l m}}} \{{(w_{l} - u_{l})}^{T} {(\frac{1}{n_{l} - 1} \sum_{m = 1}^{n_{l} - 1} v_{l m} v_{l m}^{T})}^{- 1} (w_{l} - u_{l}) \leq λ\},

from which the equivalence of Equations (6) and (8) follows immediately.

References

Flach, P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2017. [Google Scholar]
Piegorsch, W.W. Statistical Data Analytics: Foundations for Data Mining, Informatics, and Knowledge Discovery; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
Theodoridis, S.; Koutroumbas, K. Pattern Recognition, 4th ed.; Academic Press: Cambridge, MA, USA, 2009. [Google Scholar]
Webb, A.R.; Copsey, K.D. Statistical Pattern Recognition, 3rd ed.; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
Liu, W.; Bretz, F.; Srimaneekarn, N.; Peng, J.; Hayter, A.J. Confidence sets for statistical classification. Stats 2019, 2, 332–346. [Google Scholar] [CrossRef]
Serfling, R. Approximation Theorems of Mathematical Statistics; Wiley: Hoboken, NJ, USA, 1980. [Google Scholar]
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Lieberman, G.J.; Miller, R.G., Jr. Simultaneous tolerance intervals in regression. Biometrika 1963, 50, 155–168. [Google Scholar] [CrossRef]
Lieberman, G.J.; Miller, R.G., Jr.; Hamilton, M.A. Simultaneous discrimination intervals in regression. Biometrika 1967, 54, 133–145, Correction in: 1967, 58, 687. [Google Scholar] [CrossRef] [PubMed]
Mee, R.W.; Eberhardt, K.R.; Reeve, C.P. Calibration and simultaneous tolerance intervals for regression. Technometrics 1991, 33, 211–219. [Google Scholar] [CrossRef]
Han, Y.; Liu, W.; Bretz, F.; Wan, F.; Yang, P. Statistical calibration and exact one-sided simultaneous tolerance intervals for polynomial regression. J. Stat. Plan. Inference 2016, 168, 90–96. [Google Scholar] [CrossRef]
Liu, W.; Han, Y.; Bretz, F.; Wan, F.; Yang, P. Counting by weighing: Know your numbers with confidence. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2016, 65, 641–648. [Google Scholar] [CrossRef]
Peng, J.; Liu, W.; Bretz, F.; Hayter, A.J. Counting by weighing: Two-sided confidence intervals. J. Appl. Stat. 2019, 46, 262–271. [Google Scholar] [CrossRef]
Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 3rd ed.; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
Chow, Y.S.; Teicher, H. Martingales. In Probability Theory; Springer: New York, NY, USA, 1978. [Google Scholar]

Figure 1. The exact (solid) and conservative (dotted) acceptance sets for the three classes.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, W.; Bretz, F.; Hayter, A.J. Confidence Sets for Statistical Classification (II): Exact Confidence Sets. Stats 2019, 2, 439-446. https://doi.org/10.3390/stats2040030

AMA Style

Liu W, Bretz F, Hayter AJ. Confidence Sets for Statistical Classification (II): Exact Confidence Sets. Stats. 2019; 2(4):439-446. https://doi.org/10.3390/stats2040030

Chicago/Turabian Style

Liu, Wei, Frank Bretz, and Anthony J. Hayter. 2019. "Confidence Sets for Statistical Classification (II): Exact Confidence Sets" Stats 2, no. 4: 439-446. https://doi.org/10.3390/stats2040030

APA Style

Liu, W., Bretz, F., & Hayter, A. J. (2019). Confidence Sets for Statistical Classification (II): Exact Confidence Sets. Stats, 2(4), 439-446. https://doi.org/10.3390/stats2040030

Article Menu

Confidence Sets for Statistical Classification (II): Exact Confidence Sets

Abstract

1. Introduction

2. Methodology

2.1. Methodology

2.2. Algorithm for Computing the Exact $λ$

3. An Illustrative Example

4. Conclusions

Author Contributions

Acknowledgments

Conflicts of Interest

Appendix A. Mathematical Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Confidence Sets for Statistical Classification (II): Exact Confidence Sets

Abstract

1. Introduction

2. Methodology

2.1. Methodology

2.2. Algorithm for Computing the Exact λ

3. An Illustrative Example

4. Conclusions

Author Contributions

Acknowledgments

Conflicts of Interest

Appendix A. Mathematical Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.2. Algorithm for Computing the Exact $λ$