Exact Fit of Simple Finite Mixture Models

Tasche, Dirk

doi:10.3390/jrfm7040150

Open AccessArticle

Exact Fit of Simple Finite Mixture Models^†

by

Dirk Tasche

^1,2,‡

¹

Prudential Regulation Authority, Bank of England, 20 Moorgate, London EC2R 6DA, UK

²

Department of Mathematics, Imperial College London, London SW7 2AZ, UK

^†

This paper is an extended version of our paper published in Fifth International Conference on Mathematics in Finance (MiF) 2014, organized by North-West University, University of Cape Town and University of Johannesburg, 24–29 September 2014, Skukuza, Kruger National Park, South Africa.

^‡

The opinions expressed in this note are those of the author and do not necessarily reflect views of the Bank of England.

J. Risk Financial Manag. 2014, 7(4), 150-164; https://doi.org/10.3390/jrfm7040150

Submission received: 6 September 2014 / Revised: 7 October 2014 / Accepted: 4 November 2014 / Published: 20 November 2014

(This article belongs to the Special Issue Selected Papers from the Fifth International Conference on Mathematics in Finance (MiF) 2014, Organized by North-West University, University of Cape Town and University of Johannesburg, South Africa)

Download Versions Notes

Abstract

:

How to forecast next year’s portfolio-wide credit default rate based on last year’s default observations and the current score distribution? A classical approach to this problem consists of fitting a mixture of the conditional score distributions observed last year to the current score distribution. This is a special (simple) case of a finite mixture model where the mixture components are fixed and only the weights of the components are estimated. The optimum weights provide a forecast of next year’s portfolio-wide default rate. We point out that the maximum-likelihood (ML) approach to fitting the mixture distribution not only gives an optimum but even an exact fit if we allow the mixture components to vary but keep their density ratio fixed. From this observation we can conclude that the standard default rate forecast based on last year’s conditional default rates will always be located between last year’s portfolio-wide default rate and the ML forecast for next year. As an application example, cost quantification is then discussed. We also discuss how the mixture model based estimation methods can be used to forecast total loss. This involves the reinterpretation of an individual classification problem as a collective quantification problem.

Keywords:

quantification; prior class probability; probability of default

MSC classifications:

62P30; 62F10

1. Introduction

The study of finite mixture models was initiated in the 1890s by Karl Pearson when he wanted to model multimodal densities. Research on finite mixture models continued ever since but its focus changed over time as further areas of application were identified and available computational power increased. More recently the natural connection between finite mixture models and classification methods with their applications in fields like machine learning or credit scoring began to be investigated in more detail. In these applications, often it can be assumed that the mixture models are simple in the sense that the component densities are known (i.e., there is no dependence on unknown parameters) but their weights are unknown.

In this note, we explore a specific property of simple finite mixture models, namely that their maximum likelihood (ML) estimates provide an exact fit of the observed densities if the estimates exist. Conceptually, exact fit is of interest because it inspires trust that also the estimates of the weights of the components are reliable. A practical consequence of exact fit is an inequality for the finite mixture model estimate of the class probabilities in a binary classification problem and the so-called covariate shift estimate (see Corollary 7 below for details). These observations extend observations made in the case “no independent estimate of unconditional default probability given” from [1] to the multi-class case and general probability spaces.

In Section 2, we present the result on the exact fit property in a general simple finite mixture model context. In Section 3, we discuss the consequences of this result for classification and quantification problems and compare the ML estimator with other estimators that were proposed in the literature. In Section 4, we revisit the cost quantification problem as introduced in [2] as an application. In Section 5, we illustrate by a stylised example from mortgage risk management how the estimators discussed before can be deployed for the forecast of expected loss rates. Section 6 concludes the note while an appendix presents the proofs of the more involved mathematical results.

2. The Exact Fit Property

We discuss the properties of the ML estimator of the weights in a simple finite mixture model in a general measure-theoretic setting because in this way it is easier to identify the important aspects of the problem. The setting may formally be described as follows:

Assumption 1. μ is a measure on

(Ω, H)

.

g > 0

is a probability density with respect to μ. Suppose that the probability measure

P

is given by

P [H] = \int_{H} g d μ

for

H \in H

. Write

E

for the expectation with regard to

P

.

In applications, the measure μ will often be a multi-dimensional Lebesgue measure or a counting measure. The measure

P

might not be known exactly but might have to be approximated by the empirical measure associated with a sample of observations. This would give an ML estimation context.

The problem we study can be described as follows:

Problem 2. Approximate g by a mixture of probability μ-densities

f_{1}, \dots, f_{k}

, i.e.,

g \approx \sum_{i = 1}^{k} p_{i} f_{i}

for suitable

p_{1}, \dots, p_{k} \geq 0

with

\sum_{i = 1}^{k} p_{i} = 1

.

In the literature, most of the time a sample version of the problem (i.e., with

P

as an empirical measure) is discussed. Often the component densities

f_{i}

depend on parameters that have to be estimated in addition to the weights

p_{i}

(see [3] or [4], and for a more recent survey see [5]). In this note, we consider the simple case where the component densities

f_{i}

are assumed to be known and fixed. This is a standard assumption for classification (see [6]) and quantification (see [2]) problems.

Common approaches to the approximation problem 2 are

Least Squares. Determine ${min}_{(p_{1}, \dots, p_{k})} \int {(g - \sum_{i = 1}^{k} p_{i} f_{i})}^{2} d μ$ or its weighted version ${min}_{(p_{1}, \dots, p_{k})} \int g {(g - \sum_{i = 1}^{k} p_{i} f_{i})}^{2} d μ$ (see [7] and [8] for recent applications to credit risk and text categorisation respectively). The main advantage of the least squares approach compared with other approaches comes from the fact that closed-form solutions are available. However, the achieved minimum will not be zero (i.e., there is no exact fit) unless the density g happens to be a mixture of the densities $f_{i}$ .
Kullback–Leibler (KL) distance. Determine

$min_{(p_{1}, \dots, p_{k})} \int g log (\frac{g}{\sum_{i = 1}^{k} p_{i} f_{i}}) d μ = min_{(p_{1}, \dots, p_{k})} E [log (\frac{g}{\sum_{i = 1}^{k} p_{i} f_{i}})]$

(1)

(see [9] for a recent discussion). We show below (see Remark 1) that a minimum of zero (i.e., exact fit) can be achieved if we allow the $f_{i}$ to be replaced by densities $g_{i}$ with the same density ratios as the $f_{i}$ . In the case where $P$ from Assumption 1 is the empirical measure of a sample, the r.h.s. of (1) becomes (up to an additive constant) a conventional log likelihood function and thus presents an ML problem.

In the following we examine in detail the conditions under which exact fit by the KL approximation can occur. First we note two alternative representations of the KL distance (assuming all integrals are well-defined and finite):

\begin{matrix} \int g log (\frac{g}{\sum_{i = 1}^{k} p_{i} f_{i}}) d μ & = \int g log (g) d μ - \int g log (\sum_{i = 1}^{k} p_{i} f_{i}) d μ \end{matrix}

(2)

\begin{matrix} = \int g log (g) d μ - \int g log (f_{k}) d μ \end{matrix}

\begin{matrix} - \int g log (1 + \sum_{i = 1}^{k - 1} p_{i} (R_{i} - 1)) d μ \end{matrix}

(3)

with

R_{i} = \frac{f_{i}}{f_{k}}

,

i = 1, \dots, k - 1

if

f_{k} > 0

.

The problem in (2) to maximise

\int g log (\sum_{i = 1}^{k} p_{i} f_{i}) d μ

was studied in [10] (with

g = empirical measure

, which gives ML estimators). The authors suggested the Expectation-Maximisation (EM) algorithm involving conditional class probabilities for determining the maximum. This works well in general but sometimes suffers from slow convergence.

The ML version of (2) had been studied before in [11]. There the authors analysed the same iteration procedure, which they stated, however, in terms of densities instead of conditional probabilities. In [11], the iteration was derived differently to [10], namely by studying the gradient of the likelihood function. We revisit the approach from [11] from a different angle by starting from (3).

There is, however, the complication that

g log (1 + \sum_{i = 1}^{k - 1} p_{i} (R_{i} - 1))

is not necessarily integrable. However, this observation does not apply to the gradient with respect to

(p_{1}, \dots, p_{k - 1})

. We therefore focus on investigating the gradient. With

R_{i}

as in (3), let

\begin{matrix} F (p_{1}, \dots, p_{k - 1}) \overset{def}{=} \int g log (1 + \sum_{i = 1}^{k - 1} p_{i} (R_{i} - 1)) d μ \\ (p_{1}, \dots, p_{k - 1}) \in S_{k - 1} \overset{def}{=} \{(s_{1}, \dots, s_{k - 1}) : s_{1} > 0, \dots, s_{k - 1} > 0, \sum_{i = 1}^{k - 1} s_{i} < 1\} \end{matrix}

(4)

From this we obtain for the gradient of F

\begin{matrix} G_{j} (p_{1}, \dots, p_{k - 1}) & \overset{def}{=} \frac{\partial F}{\partial p_{j}} (p_{1}, \dots, p_{k - 1}) \end{matrix}

\begin{matrix} = \int \frac{g (R_{j} - 1)}{1 + \sum_{i = 1}^{k - 1} p_{i} (R_{i} - 1)} d μ, j = 1, \dots, k - 1 \end{matrix}

(5)

G_{j} (p_{1}, \dots, p_{k - 1})

is well-defined and finite for

(p_{1}, \dots, p_{k - 1}) \in S_{k - 1}

:

\begin{matrix} | G_{j} (p_{1}, \dots, p_{k - 1}) | & \leq \int \frac{g (R_{j} + 1)}{1 - \sum_{i = 1}^{k - 1} p_{i} + \sum_{i = 1}^{k - 1} p_{i} R_{i}} d μ \\ \leq \frac{1}{p_{j}} + \frac{1}{1 - \sum_{i = 1}^{k - 1} p_{i}} < \infty \end{matrix}

(6)

We are now in a position to state the main result of this note.

Theorem 3. Let g and μ be as in Assumption 1. Assume that

R_{1}, \dots, R_{k - 1} \geq 0

are

H

-measurable functions on Ω. Suppose there is a vector

(p_{1}, \dots, p_{k - 1}) \in S_{k - 1}

such that

0 = G_{i} (p_{1}, \dots, p_{k - 1}), i = 1, \dots, k - 1

(7)

for

G_{i}

as defined in (5). Define

p_{k} = 1 - \sum_{j = 1}^{k - 1} p_{j}

. Then the following two statements hold:

(a): $g_{i} = \frac{g R_{i}}{1 + \sum_{j = 1}^{k - 1} p_{j} (R_{j} - 1)}$ , $i = 1, \dots, k - 1$ , and $g_{k} = \frac{g}{1 + \sum_{j = 1}^{k - 1} p_{j} (R_{j} - 1)}$ are probability densities with respect to μ such that $g = \sum_{i = 1}^{k} p_{i} g_{i}$ and $R_{i} = \frac{g_{i}}{g_{k}}$ , $i = 1, \dots, k - 1$ .
(b): Let $R_{1} - 1$ , …, $R_{k - 1} - 1$ additionally be linearly independent, i.e.,

$P [\sum_{i = 1}^{k - 1} a_{i} (R_{i} - 1) = 0] = 1 i m p l i e s 0 = a_{1} = \dots = a_{k - 1}$

Assume that $h_{1}, \dots, h_{k - 1} \geq 0$ , $h_{k} > 0$ are μ-densities of probability measures on $(Ω, H)$ and $(q_{1}, \dots, q_{k - 1}) \in S_{k - 1}$ with $q_{k} = 1 - \sum_{i = 1}^{k - 1} q_{i}$ is such that $g = \sum_{i = 1}^{k} q_{i} h_{i}$ and $R_{i} = \frac{h_{i}}{h_{k}}$ , $i = 1, \dots, k - 1$ . Then it follows that $q_{i} = p_{i}$ and $h_{i} = g_{i}$ for $g_{i}$ as defined in (a), for $i = 1, \dots, k$ .

See appendix for a proof of Theorem 3.

Remark 1. If the KL distance on the l.h.s. of (2) is well-defined and finite for all

(p_{1}, \dots, p_{k - 1}) \in S_{k - 1}

then under the assumptions of Theorem 3 (b) there is a unique

(p_{1}^{*}, \dots, p_{k - 1}^{*}) \in S_{k - 1}

such that the KL distance of g and

\sum_{i = 1}^{k} p_{i}^{*} f_{i}

is minimal. In addition, by Theorem 3 (a), there are densities

g_{i}

with

\frac{g_{i}}{g_{k}} = \frac{f_{i}}{f_{k}}

for

i = 1, \dots, k - 1

such that the KL distance of g and

\sum_{i = 1}^{k} p_{i}^{*} g_{i}

is zero—this is the exact fit property of simple finite mixture models alluded to in the title of this note. ☐

Remark 2.

(a)

Theorem 3 (b) provides a simple condition for the uniqueness of the solution to (7). In the case

k = 2

this condition simplifies to

P [R_{1} = 1] < 1

(8a)

For

k > 2

there is no similarly simple condition for the existence of a solution in

S_{k - 1}

. However, as noted in [12] (Example 4.3.1), there is a simple necessary and sufficient condition for the existence of a solution in

S_{1} = (0, 1)

to (7) in the case

k = 2

:

E [R_{1}] > 1 a n d E [R_{1}^{- 1}] > 1

(8b)

(b)

Suppose we are in the setting of Theorem 3 (a) with

k > 2

and all

R_{i} > 0

. Hence there are μ-densities

g_{1}, \dots, g_{k}

,

(p_{1}, \dots, p_{k - 1}) \in S_{k - 1}

, and

p_{k} = 1 - \sum_{i = 1}^{k - 1} p_{i}

such that

g = \sum_{i = 1}^{k} p_{i} g_{i}

and

R_{i} = \frac{g_{i}}{g_{k}}

,

i = 1, \dots, k

. Let

\bar{g} = \frac{\sum_{j = 2}^{k} p_{j} g_{j}}{1 - p_{1}}

. Then we have another decomposition of g, namely

g = p_{1} g_{1} + (1 - p_{1}) \bar{g}

, with

p_{1} \in (0, 1)

. The proof of Theorem 3 (b) shows that, with

\bar{R} = \frac{g_{1}}{\bar{g}}

this implies

0 = \int \frac{g (\bar{R} - 1)}{1 + p_{1} (\bar{R} - 1)} d μ

(9)

By (a) there is a solution

p_{1} \in (0, 1)

to (9) if and only if

E [\bar{R}] > 1

and

E [(\bar{R})^{- 1}] > 1

.

(c)

As mentioned in (a), an interesting question for the application of Theorem 3 is how to find out whether or not there is a solution to (7) in

S_{k - 1}

for

k > 2

. The iteration suggested in [11] and [10] will correctly converge to a point on the boundary of

S_{k - 1}

if there is no solution in the interior (Theorem 2 of [11]). However, convergence may be so slow that it may remain unclear whether a component of the limit is zero (and therefore the solution is on the boundary) or genuinely very small but positive. The straightforward Newton–Raphson approach for determining the maximum of F defined by (4) may converge faster but may also become unstable for solutions close to or on the boundary of

S_{k - 1}

.

However, when

k > 2

, the observation made in (b) suggests that the following Gauss–Seidel-type iteration works if the initial value

(p_{1}^{(0)}, \dots, p_{k - 1}^{(0)})

with

p_{k}^{(0)} = 1 - \sum_{i = 1}^{k - 1} p_{i}^{(0)}

is sufficiently close to the solution (if any) of (7):

-: Assume that for some $n \geq 0$ an approximate solution $(q_{1}, \dots, q_{k}) = (p_{1}^{(n)}$ , …, $p_{k}^{(n)})$ has been found.
-: For $i = 1, \dots, k$ try successively to update $(q_{1}, \dots, q_{k})$ by solving (9) with component i playing the role of component 1 in (b) and $p_{1} = q_{i}$ as well as $\bar{g} = \frac{\sum_{j = 1, j \neq i}^{k} q_{j} g_{j}}{1 - q_{i}}$ . If for all $i = 1, \dots, k$ the sufficient and necessary condition for the updated $q_{i}$ to be in $(0, 1)$ is not satisfied then stop—it is then likely that there is no solution to (7) in $S_{k - 1}$ . Otherwise update $q_{i}$ with the solution of (9) where possible, resulting in $q_{i, new}$ , and set

$q_{j, new} = \frac{1 - q_{i, new}}{1 - q_{i}} q_{j}, j \neq i$
-: After step k set $(p_{1}^{(n + 1)}, \dots, p_{k}^{(n + 1)}) = (q_{1, new}, \dots, q_{k, new})$ if the algorithm has not been stopped by violation of the resolvability condition for (7).
-: Terminate the calculation when a suitable distance measure between successive $(p_{1}^{(n)}, \dots$ , $p_{k}^{(n)})$ is sufficiently small. ☐

3. Application to Quantification Problems

Finite mixture models occur naturally in machine learning contexts. Specifically, in this note we consider the following context:

Assumption 4.

$(Ω, H)$ is a measurable space. For some $k \geq 2$ , $A_{1}, \dots, A_{k} \subset Ω$ is a partition of Ω. $A$ is the σ-field generated by $H$ and the $A_{i}$ , i.e.,

$A = σ (H \cup {A_{1}, \dots, A_{k}}) = \{⋃_{i = 1}^{k} (A_{i} \cap H_{i}) : H_{1}, \dots, H_{k} \in H\}$
$P_{0}$ is a probability measure on $(Ω, A)$ with $P_{0} [A_{i}] > 0$ for $i = 1, \dots, k$ . $P_{1}$ is a probability measure on $(Ω, H)$ . Write $E_{i}$ for the expectation with respect to $P_{i}$ .
There is a measure μ on $(Ω, H)$ and μ-densities $f_{1}, \dots, f_{k - 1} \geq 0$ , $f_{k} > 0$ such that

$P_{0} [H | A_{i}] = \int_{H} f_{i} d μ, i = 1, \dots, k, H \in H$

The space

(Ω, A, P_{0})

describes the training set of a classifier. On the training set, for each example both the features (expressed by

H

) and the class (described by one of the

A_{i}

) are known. Note that

f_{k} > 0

implies

A_{k} \notin H

.

(Ω, H, P_{1})

describes the test set on which the classifier is deployed. On the test set only the features of the examples are known.

In mathematical terms, quantification might be described as the task to extend

P_{1}

onto

A

, based on properties observed from the training set, i.e., of

P_{0}

. Basically, this means to estimate prior class probabilities (or prevalences)

P_{1} [A_{i}]

on the test dataset. In this note, the assumption is that

P_{1} | A \neq P_{0} | A

. In the machine learning literature, this situation is called dataset shift (see [6] and references therein).

Specifically, we consider the following two dataset shift types (according to [6]):

Covariate shift. $P_{1} | H \neq P_{0} | H$ but $P_{1} [A_{i} | H] = P_{0} [A_{i} | H]$ for $i = 1, \dots, k$ . In practice, this implies $P_{1} [A_{i}] \neq P_{0} [A_{i}]$ for most if not all of the i. In a credit risk context, two portfolios of rated credit exposures differ by a covariate shift if the rating grade-level probabilities of default are equal but the rating grade distributions are different in the two portfolios.
Prior probability shift. $P_{1} [A_{i}] \neq P_{0} [A_{i}]$ for at least one i but $P_{1} [H | A_{i}]$ $= P_{0} [H | A_{i}]$ for $i = 1, \dots, k$ , $H \in H$ . This implies $P_{1} | H \neq P_{0} | H$ if $P_{0} [\cdot | A_{1}]$ , …, $P_{0} [\cdot | A_{k}]$ are linearly independent. In a credit risk context, two portfolios of rated credit exposures differ by a prior probability shift if (1) the rating grade distributions of defaulting exposures in the two portfolios are equal and (2) the rating grade distributions of surviving exposures in the two portfolios are equal but (3) the portfolio-wide default rates of the two portfolios are different.

In practice, it is likely to have

P_{1} [A_{i}] \neq P_{0} [A_{i}]

for some i in both covariate shift and prior probability shift. Hence, quantification in the sense of estimation of the

P_{1} [A_{i}]

is important for both covariate shift and prior probability shift. Under covariate shift assumption, a natural estimator of

P_{1} [A_{i}]

is given by

{\tilde{P}}_{1} [A_{i}] = E_{1} [P_{0} [A_{i} | H]], i = 1, \dots, k

(10)

Under prior probability shift, the choice of suitable estimators of

P_{1} [A_{i}]

is less obvious.

The following result generalises the Scaled Probability Average method of [13] to the multi-class case. It allows to derive prior probability shift estimates of prior class probabilities from covariate shift estimates as given by (10).

Proposition 5. Under Assumption 4, suppose that there are

q_{1} \geq 0

, …,

q_{k} \geq 0

with

\sum_{i = 1}^{k} q_{i} = 1

such that

P_{1}

can be represented as a simple finite mixture as follows:

P_{1} [H] = \sum_{i = 1}^{k} q_{i} P_{0} [H | A_{i}], H \in H

(11a)

Then it follows that

(\begin{matrix} E_{1} [P_{0} [A_{1} | H]] \\ ⋮ \\ E_{1} [P_{0} [A_{k} | H]] \end{matrix}) = M (\begin{matrix} q_{1} \\ ⋮ \\ q_{k}, \end{matrix})

(11b)

where the matrix

M = {(m_{i j})}_{i, j = 1, \dots, k}

is given by

\begin{matrix} m_{i j} & = E_{0} [P_{0} [A_{i} | H] | A_{j}] \\ = E_{0} [P_{0} [A_{i} | H] P_{0} [A_{j} | H]] / P_{0} [A_{j}] \end{matrix}

(11c)

Proof. Immediate from (11a) and the definition of conditional expectation. ☐

For practical purposes, the representation of

m_{i j}

in the first row of (11c) is more useful because most of the time no exact estimate of

P_{0} [A_{j} | H]

will be available. As a consequence, there might be a non-zero difference between the values of the expectations in the first and second row of (11c) respectively. In contrast to the second row, for the derivation of the r.h.s. of the first row of (11c), however, no use of the specific properties of conditional expectations has been made.

Corollary 6. In the setting of Proposition 5, suppose that

k = 2

. Define

R_{0}^{2} = \frac{{var}_{0} [P_{0} [A_{1} | H]]}{P_{0} [A_{1}] (1 - P_{0} [A_{1}])} \in [0, 1]

with

{var}_{0}

denoting the variance under

P_{0}

. Then we have

E_{1} [P_{0} [A_{1} | H]] = P_{0} [A_{1}] (1 - R_{0}^{2}) + q_{1} R_{0}^{2}

(12)

See appendix for a proof of Corollary 6. By Corollary 6, it holds that

\begin{matrix} P_{0} [A_{1}] \leq E_{1} [P_{0} [A_{1} | H]] \leq q_{1}, & if P_{0} [A_{1}] \leq q_{1} \\ P_{0} [A_{1}] \geq E_{1} [P_{0} [A_{1} | H]] \geq q_{1}, & if P_{0} [A_{1}] \geq q_{1} \end{matrix}

(13)

Hence, for binary classification problems the covariate shift estimator (10) underestimates the change in the class probabilities if the dataset shift is not a covariate shift but a prior probability shift. See Section 2.1 of [2] for a similar result for the Classify & Count estimator.

However, according to (12) the difference between the covariate shift estimator and the true prior probability is the smaller the greater the discriminatory power (as measured by the generalised

R^{2}

) of the classifier is. Moreover, both (12) and (11b) provide closed-form solutions for

q_{1}

, …,

q_{k}

that transform the covariate shift estimates into correct estimates under the prior probability shift assumption. In the following, the estimators defined in this way are called Scaled Probability Average estimators.

Corollary 6 on the relationship between covariate shift and Scaled Probability Average estimates in the binary classification case can be generalised to the relationship between covariate shift and KL distance estimates.

Corollary 7. Under Assumption 4, consider the case

k = 2

. Let

R_{1} = \frac{f_{1}}{f_{2}}

and suppose that (8b) holds for

E = E_{1}

such that a solution

p_{1} \in (0, 1)

of (7) exists. Then there is some

α \in [0, 1]

such that

E_{1} [P_{0} [A_{1} | H]] = (1 - α) P_{0} [A_{1}] + α p_{1}

See appendix for a proof of Corollary 7. From the corollary, we obtain an inequality for the covariate shift and KL distances estimates, which is similar to (13):

\begin{matrix} P_{0} [A_{1}] \leq E_{1} [P_{0} [A_{1} | H]] \leq p_{1}, & if P_{0} [A_{1}] \leq p_{1} \\ P_{0} [A_{1}] \geq E_{1} [P_{0} [A_{1} | H]] \geq p_{1}, & if P_{0} [A_{1}] \geq p_{1} \end{matrix}

(14)

How is the KL distance estimator (or the ML estimator in the case of

P

being the empirical measure) of the prior class probabilities, defined by the solution of (7), in general, related to the covariate shift and Scaled Probability Average estimators?

Suppose the test dataset differs from the training dataset by a prior probability shift with positive class probabilities, i.e., (11a) applies with

q_{1}, \dots, q_{k} > 0

. Under Assumption 4 and a mild linear independence condition on the ratios of the densities

f_{i}

, then Theorem 3 implies that the KL distance and Scaled Probability Average estimators give the same results. Observe that in the context given by Assumption 4 the variables

R_{i}

from Theorem 3 can be directly defined as

R_{i} = \frac{f_{i}}{f_{k}}

,

i = 1, \dots, k - 1

or, equivalently, by

R_{i} = \frac{P_{0} [A_{i} | H]}{P_{0} [A_{k} | H]} \frac{P_{0} [A_{k}]}{P_{0} [A_{i}]}, i = 1, \dots, k - 1

(15)

Representation (15) of the density ratios might be preferable in particular if the classifier involved has been built by binary or multinomial logistic regression.

In general, by Theorem 3, the result of applying the KL distance estimator to the test feature distribution

P_{1}

, in the quantification problem context described by Assumption 4, is a representation of

P_{1}

as a mixture of distributions whose density ratios are the same as the density ratios of the class feature distributions

P_{0} [\cdot | A_{i}]

,

i = 1, \dots, k

.

Hence, the KL distance estimator makes sense under an assumption of identical density ratios in the training and test datasets. On the one hand, this assumption is similar to the assumption of identical conditional class probabilities in the covariate shift assumption but does not depend in any way on the training set prior class probabilities. This is in contrast to the covariate shift assumption where implicitly a “memory effect” with regard to the training set prior class probabilities is accepted.

On the other hand the “identical density ratios” assumption is weaker than the “identical densities” assumption (the former is implied by the latter), which is part of the prior probability assumption.

One possible description of “identical density ratios” and the related KL distance estimator is that “identical density ratios” generalises “identical densities” in such a way that exact fit of the test set feature distribution is achieved (which by Theorem 3 is not always possible). It is therefore fair to say that “identical density ratios” is closer to “identical densities” than to “identical conditional class probabilities”.

Given training data with full information (indicated by the σ-field

A

in Assumption 4) and test data with information only on the features but not on the classes (σ-field

H

in Assumption 4), it is not possible to decide whether the covariate shift or the identical density ratios assumption is more appropriate for the data, because both assumptions result in the exact fit of the test set feature distribution in Assumption 4), it is not possible to decide whether the covariate shift or the identical density ratios assumption is more appropriate for the data, because both assumptions result in the exact fit of the test set feature distribution

P_{1} | H

but in general give quite different estimates of the test set prior class probabilities (see Corollary 7 and Section 5). Only if Equation (7) has no solution with positive components it can be said that “identical density ratios” does not properly describe the test data because then there is no exact fit of the test set feature distribution. In that case “covariate shift” might not be appropriate either but at least it delivers a mathematically consistent model of the data.

If both “covariate shift” and “identical density ratios” provide consistent models (i.e., exact fit of the test set feature distribution) non-mathematical considerations of causality (are features caused by class or is class caused by features?) may help in choosing the more suitable assumption. See [14] for a detailed discussion of this issue.

4. Cost Quantification

“Cost quantification” is explained in [2] as follows:

“The second form of the quantification task is for a common situation in business where a cost or value attribute is associated with each case. For example, a customer support log has a database field to record the amount of time spent to resolve each individual issue, or the total monetary cost of parts and labor used to fix the customer’s problem. … The cost quantification task for machine learning: given a limited training set with class labels, induce a cost quantifier that takes an unlabeled test set as input and returns its best estimate of the total cost associated with each class. In other words, return the subtotal of cost values for each class.”

Careful reading of Section 4.2 of [2] reveals that the favourite solutions for cost quantification presented by the author essentially apply only to the case where the cost attributes are constant on the classes. Only then the

C^{+}

as used in Equations (4) and (5) of [2] stand for the same conditional expectations—the same observation applies to

C^{-}

.

Cost quantification can be more generally treated under Assumption 4 of this note. Denote by C the (random) cost associated with an example. According to the description of cost quantification quoted above then C is actually a feature of the example and, therefore, may be considered an

H

-measurable random variable under Assumption 4.

In mathematical terms, the objective of cost quantification is the estimation of the total expected cost per class

E_{1} [C 1_{A_{i}}]

,

i = 1, \dots, k

. For a set M, its indicator function

1_{M}

is defined as

1_{M} (m) = 1

for

m \in M

and

1_{M} (m) = 0

for

m \notin M

.

Covariate shift assumption. Under this assumption we obtain

E_{1} [C 1_{A_{i}}] = E_{1} [C P_{1} [A_{i} | H]] = E_{1} [C P_{0} [A_{i} | H]]

(16)

This gives a probability-weighted version of the “Classify & Total” estimator of [2].

Constant density ratios assumption. Let

R_{i} = \frac{f_{i}}{f_{k}}

,

i = 1, \dots, k - 1

. If (7) (with

μ = P_{1}

and

g = 1

) has a solution

p_{1} > 0

, …,

p_{k - 1} > 0

,

p_{k} = 1 - \sum_{j = 1}^{k - 1} p_{j} < 1

, then we can estimate the conditional class probabilities

P_{1} [A_{i} | H]

by

\begin{matrix} P_{1} [A_{i} | H] & = \frac{p_{i} R_{i}}{1 + \sum_{j = 1}^{k - 1} p_{j} (R_{j} - 1)}, i = 1, \dots, k - 1 \\ P_{1} [A_{k} | H] & = \frac{p_{k}}{1 + \sum_{j = 1}^{k - 1} p_{j} (R_{j} - 1)} \end{matrix}

From this, it follows that

\begin{matrix} E_{1} [C 1_{A_{i}}] & = p_{i} E_{1} [\frac{C R_{i}}{1 + \sum_{j = 1}^{k - 1} p_{j} (R_{j} - 1)}], i = 1, \dots, k - 1 \\ E_{1} [C 1_{A_{k}}] & = p_{k} E_{1} [\frac{C}{1 + \sum_{j = 1}^{k - 1} p_{j} (R_{j} - 1)}] \end{matrix}

(17)

Obviously, the accuracy of the estimates on the r.h.s. of both (16) and (17) strongly depends on the accuracy of the estimates of

P_{0} [A_{i} | H]

and the density ratios on the training set. Accurate estimates of these quantities, in general, will make full use of the information in the σ-field

H

(i.e., the information available at the time of estimation) and, because of the

H

-measurability of C, of the cost feature C. In order to achieve this, C must be used as an explanatory variable when the relationship between the classes

A_{i}

and the features as reflected in

H

is estimated (e.g., by a regression approach). As one-dimensional densities are relatively easy to estimate, it might make sense to deploy (16) and (17) with the choice

H = σ (C)

.

Note that this conclusion, at first glance, seems to contradict Section 5.3.1 of [2]. There it is recommended that “the cost attribute almost never be given as a predictive input feature to the classifier”. Actually, with regard to the cost quantifiers suggested in [2], this recommendation is reasonable because the main component of the quantifiers as stated in (6) of [2] is correctly specified only if there is no dependence of the cost attribute C and the classifier. Not using C as an explanatory variable, however, does not necessarily imply that the dependence between C and the classifier is weak. Indeed, if the classifier has got any predictive power and C is on average different on the different classes of examples, then there must be a non-zero correlation between the cost attribute C and the output of the classifier.

5. Loss Rates Estimation with Mixture Model Methods

Theorem 3 and the results of Section 3 have obvious applications to the problem of forecasting portfolio-wide default rates in portfolios of rated or scored borrowers. The forecast portfolio-wide default rate may be interpreted in an individual sense as a single borrower’s unconditional probability of default. However, there is also an interpretation in a collective sense as the forecast total proportion of defaulting borrowers.

The statements of Theorem 3 and Assumption 4 are agnostic in the sense of not suggesting an individual or collective interpretation of the models under inspection. In contrast, by explaining Assumption 4 in terms of a classifier and the examples to which it is applied, we have suggested an individual interpretation of the assumption.

However, there is no need to adopt this perspective on Assumption 4 and the results of Section 3. Instead of interpreting

P_{0} [A_{1}]

as an individual example’s probability of belonging to class 1, we could as well describe

P_{0} [A_{1}]

as the proportion of a mass or substance that has property 1. If we do so, we switch from an interpretation of probability spaces in terms of likelihoods associated with individuals to an interpretation in terms of proportions of parts of masses or substances.

Let us look at a retail mortgage portfolio as an illustrative example. Suppose that each mortgage has an associating loan-to-value (LTV) that indicates how well the mortgage loan is secured by the pledged property. Mortgage providers typically report their exposures and losses in tables that provide this information per LTV-band without specifying numbers or percentages of borrowers involved. Table 1 shows a stylised example of how such a report might look like.

Table 1. Fictitious report on mortgage portfolio exposure distribution and loss rates.

**Table 1.** Fictitious report on mortgage portfolio exposure distribution and loss rates.
LTV Band	Last Year		This Year
	% of exposure	of this % lost	% of exposure	of this % lost
More than 100%	$10.3$	$15.0$	$13.3$	?
Between 90% and 100%	$28.2$	$2.2$	$24.2$	?
Between 70% and 90%	$12.9$	$1.1$	$12.8$	?
Between 50% and 70%	$24.9$	$0.5$	$25.4$	?
Less than 50%	$23.8$	$0.2$	$24.3$	?
All	$100.0$	$2.2$	$100.0$	?

This portfolio description fits well into the framework described by Assumption 4. Choose events

H_{1} =

“More than 100% LTV”,

H_{2} =

“Between 90% and 100% LTV”, etc. Then the σ-field

H

is generated by the finite partition

H_{1}, \dots, H_{5}

. Similarly, choose

A_{1} =

“lost” and

A_{2} =

“not lost”. The measure

P_{0}

describes last year’s observations,

P_{1}

specifies the distribution of the exposure over the LTV bands as observed at the beginning of this year—which is the forecast period. We can then try and replace the question marks in Table 1 by deploying the estimators discussed in Section 3. Table 2 shows the results.

Table 2. This year’s loss rate forecast for portfolio from Table 1. The

R_{0}^{2}

used for calculating the Scaled Probability Average is 7.7%.

**Table 2.** This year’s loss rate forecast for portfolio from Table 1. The $R_{0}^{2}$ used for calculating the Scaled Probability Average is 7.7%.
LTV Band	Covariate Shift	Scaled Probability Average	KL Distance
	% lost	% lost	% lost
More than 100%	15.0	34.6	35.4
Between 90% and 100%	2.2	6.3	6.5
Between 70% and 90%	1.1	3.2	3.3
Between 50% and 70%	0.5	1.5	1.5
Less than 50%	0.2	0.6	0.6
All	2.8	7.3	7.1

Clearly, the estimates under the prior probability shift assumptions are much more sensitive to changes in the distribution of the features (i.e., LTV bands) than the estimates under the covariate shift assumption. Thus the theoretical results of Corollaries 6 and 7 are confirmed. However, recall that there is no right or wrong here as all the numbers in Table 1 are purely fictitious. Nonetheless, we could conclude that in applications with unclear causalities (like for credit risk measurement) it might make sense to compute both covariate shift estimates and KL estimates (more suitable under a prior probability shift assumption) in order to gauge the possible range of outcomes.

6. Conclusions

We have revisited the maximum likelihood estimator (or more generally Kullback–Leibler distance estimator) of the component weights in simple finite mixture models. We have found that (if all weights of the estimate are positive) it enjoys an exact fit property, which makes it even more attractive with regard to mathematical consistency. We have suggested a Gauss–Seidel-type approach to the calculation of the Kullback–Leibler distance estimator that triggers an alarm if there is no solution with all components positive (which would indicate that the number of modelled classes may be reduced).

In the context of two-class quantification problems, as a consequence of the exact fit property we have shown theoretically and by example that the straightforward “covariate shift” estimator of the prior class probabilities may seriously underestimate the change of the prior probabilities if the covariate shift assumption is wrong and instead a prior probability shift has occurred. This underestimation can be corrected by the Scaled Probability Average approach, which we have generalised to the multi-class case, or the Kullback–Leibler distance estimator.

As an application example, we then have discussed cost quantification, i.e., the attribution of total cost to classes on the basis of characterising features when class membership is unknown. In addition, we have illustrated by example that the mixture model approach to quantification is not restricted to the forecast of prior probabilities but can also be deployed for forecasting loss rates.

Both the theoretical (Corollary 7) and illustrative numerical results (Table 2) suggest that the difference between the Kullback–Leibler and the more familiar covariate shift estimates of prior class probabilities (default probabilities in the case of credit risk) can be material. This may actually be interpreted as an indication of model risk because the assumptions underlying the two estimators are quite different (“identical density ratios” vs. “identical conditional class probabilities”). It seems therefore reasonable to calculate both estimates and analyse the potential reasons for the difference if it turns out to be material.

Appendix: Proofs

Proof of Theorem 3. For (a) we have to show that

\int g_{i} d μ = 1

,

i = 1, \dots, k

. The other claims are obvious. By (5), Equation (7) implies that

\int g_{i} d μ = \int g_{k} d μ

,

i = 1, \dots, k - 1

. This in turn implies

1 = \int g d μ = \sum_{i = 1}^{k} p_{i} \int g_{i} d μ = \int g_{k} d μ \sum_{i = 1}^{k} p_{i} = \int g_{k} d μ

Hence

\int g_{i} d μ = 1

for all i.

With regard to (b), observe that

\frac{\partial G_{i}}{\partial q_{j}} (q_{1}, \dots, q_{k - 1}) = - \int \frac{g (R_{i} - 1) (R_{j} - 1)}{{(1 + \sum_{ℓ = 1}^{k - 1} q_{ℓ} (R_{ℓ} - 1))}^{2}} d μ

Like in (6), for

(q_{1}, \dots, q_{k - 1}) \in S_{k - 1}

it can easily be shown that

\frac{\partial G_{i}}{\partial q_{j}}

is well-defined and finite. Denote by

J = {(\frac{\partial G_{i}}{\partial q_{j}})}_{i, j = 1, \dots, k - 1}

the Jacobi matrix of the

G_{i}

and let

R = (R_{1} - 1, \dots, R_{k - 1} - 1)

. Let

a \in R^{k - 1}

and

a^{T}

be the transpose of a. Then it holds that

a J a^{T} = - \int g \frac{{(R a^{T})}^{2}}{{(1 + \sum_{ℓ = 1}^{k - 1} q_{ℓ} (R_{ℓ} - 1))}^{2}} d μ \leq 0

In addition, by assumption on the linear independence of the

R_{i} - 1

,

0 = a J a^{T}

implies

a = 0 \in R^{k - 1}

. Hence J is negative definite. From this it follows by the mean value theorem that the solution

(p_{1}, \dots, p_{k - 1})

of (7) is unique in

S_{k - 1}

.

By assumption on

(q_{1}, \dots, q_{k})

and

h_{1}, \dots, h_{k}

we have for

i = 1, \dots, k - 1

\begin{matrix} 1 & = \int h_{i} d μ = \int \frac{g h_{i}}{\sum_{j = 1}^{k} q_{j} h_{j}} d μ = \int \frac{g R_{i}}{1 + \sum_{j = 1}^{k - 1} q_{j} (R_{j} - 1)} d μ, and \\ 1 & = \int h_{k} d μ = \int \frac{g h_{k}}{\sum_{j = 1}^{k} q_{j} h_{j}} d μ = \int \frac{g}{1 + \sum_{j = 1}^{k - 1} q_{j} (R_{j} - 1)} d μ \end{matrix}

(18)

Hence we obtain

0 = \int \frac{g (R_{i} - 1)}{1 + \sum_{j = 1}^{k - 1} q_{j} (R_{j} - 1)} d μ

,

i = 1, \dots, k - 1

. By uniqueness of the solution of (7) it follows that

q_{i} = p_{i}

,

i = 1, \dots, k

. With this, it can be shown similarly to (18) that

\int_{H} h_{i} d μ = \int_{H} g_{i} d μ, H \in H, i = 1, \dots, k

This implies

h_{i} = g_{i}

,

i = 1, \dots, k

. ☐

Proof of Corollary 6. We start from the first element of the vector-Equation (11b) and apply some algebraic manipulations:

\begin{matrix} E_{1} [P_{0} [A_{1} | H]] & = \frac{q_{1}}{P_{0} [A_{1}]} E_{0} [P_{0} {[A_{1} | H]}^{2}] \\ + \frac{1 - q_{1}}{1 - P_{0} [A_{1}]} E_{0} [P_{0} [A_{1} | H] (1 - P_{0} [A_{1} | H])] \\ = {(P_{0} [A_{1}] (1 - P_{0} [A_{1}]))}^{- 1} (q_{1} (1 - P_{0} [A_{1}]) E_{0} [P_{0} {[A_{1} | H]}^{2}] \\ + P_{0} [A_{1}] (1 - q_{1}) (P_{0} [A_{1}] - E_{0} [P_{0} {[A_{1} | H]}^{2}])) \\ = {(P_{0} [A_{1}] (1 - P_{0} [A_{1}]))}^{- 1} (q_{1} (E_{0} [P_{0} {[A_{1} | H]}^{2}] - P_{0} {[A_{1}]}^{2}) \\ + P_{0} [A_{1}] (P_{0} [A_{1}] - E_{0} [P_{0} {[A_{1} | H]}^{2}])) \end{matrix}

From

P_{0} [A_{1}] - E_{0} [P_{0} {[A_{1} | H]}^{2}] = P_{0} [A_{1}] (1 - P_{0} [A_{1}]) - {var}_{0} [P_{0} [A_{1} | H]]

now the result follows. ☐

Proof of Corollary 7. Suppose that

g > 0

is a density of

P_{1}

with respect to some measure ν on

(Ω, H)

. The measure ν does not necessarily equal the measure μ mentioned in Assumption 4. We can choose

ν = P_{1}

and

g = 1

if there is no other candidate. By Theorem 3 (a) then there are ν-densities

g_{1} \geq 0

,

g_{2} > 0

such that

\frac{g_{1}}{g_{2}} = R_{1}

and

g = p_{1} g_{1} + (1 - p_{1}) g_{2}

.

We define a new probability measure

{\tilde{P}}_{0}

on

(Ω, A)

by setting

\begin{matrix} {\tilde{P}}_{0} [A_{i}] & = P_{0} [A_{i}], i = 1, 2 \\ {\tilde{P}}_{0} [H | A_{i}] & = \int_{H} g_{i} d ν, H \in H, i = 1, 2 \end{matrix}

By construction of

{\tilde{P}}_{0}

it holds that

P_{1} [H] = p_{1} {\tilde{P}}_{0} [H | A_{1}] + (1 - p_{1}) {\tilde{P}}_{0} [H | A_{2}], H \in H

Hence we may apply Corollary 6 to obtain

E_{1} [{\tilde{P}}_{0} [A_{1} | H]] = P_{0} [A_{1}] (1 - {\tilde{R}}_{0}^{2}) + q_{1} {\tilde{R}}_{0}^{2}

where

{\tilde{R}}_{0}^{2} \in [0, 1]

is defined like

R_{0}^{2}

with

P_{0}

replaced by

{\tilde{P}}_{0}

. Observe that also by construction of

{\tilde{P}}_{0}

we have

{\tilde{P}}_{0} [A_{1} | H] = \frac{P_{0} [A_{1}] g_{1}}{P_{0} [A_{1}] g_{1} + P_{0} [A_{2}] g_{2}} = \frac{P_{0} [A_{1}] R_{1}}{P_{0} [A_{1}] R_{1} + P_{0} [A_{2}]} = P_{0} [A_{1} | H]

With the choice

α = {\tilde{R}}_{0}^{2}

this proves Corollary 7. ☐

Conflicts of Interest

The author declares no conflict of interest.

References

D. Tasche. “The art of probability-of-default curve calibration.” J. Credit Risk 9 (2013): 63–103. [Google Scholar]
G. Forman. “Quantifying counts and costs via classification.” Data Min. Knowl. Discov. 17 (2008): 164–206. [Google Scholar] [CrossRef]
R. Redner, and H. Walker. “Mixture densities, maximum likelihood and the EM algorithm.” SIAM Rev. 26 (1984): 195–239. [Google Scholar] [CrossRef]
S. Frühwirth-Schnatter. Finite Mixture and Markov Switching Models: Modeling and Applications to Random Processes. New York, NY, USA: Springer, 2006. [Google Scholar]
P. Schlattmann. Medical Applications of Finite Mixture Models. Berlin, Germany: Springer, 2009. [Google Scholar]
J. Moreno-Torres, T. Raeder, R. Alaiz-Rodriguez, N. Chawla, and F. Herrera. “A unifying view on dataset shift in classification.” Pattern Recognit. 45 (2012): 521–530. [Google Scholar] [CrossRef]
V. Hofer, and G. Krempl. “Drift mining in data: A framework for addressing drift in classification.” Comput. Stat. Data Anal. 57 (2013): 377–391. [Google Scholar] [CrossRef]
D. Hopkins, and G. King. “A Method of Automated Nonparametric Content Analysis for Social Science.” Am. J. Polit. Sci. 54 (2010): 229–247. [Google Scholar] [CrossRef]
M. Du Plessis, and M. Sugiyama. “Semi-supervised learning of class balance under class-prior change by distribution matching.” Neural Netw. 50 (2014): 110–119. [Google Scholar] [CrossRef] [PubMed]
M. Saerens, P. Latinne, and C. Decaestecker. “Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure.” Neural Comput. 14 (2002): 21–41. [Google Scholar] [CrossRef] [PubMed]
C. Peters, and W. Coberly. “The numerical evaluation of the maximum-likelihood estimate of mixture proportions.” Commun. Stat. Theory Methods 5 (1976): 1127–1135. [Google Scholar] [CrossRef]
D. Titterington, A. Smith, and U. Makov. Statistical Analysis of Finite Mixture Distributions. New York, NY, USA: Wiley, 1985. [Google Scholar]
A. Bella, C. Ferri, J. Hernandez-Orallo, and M. Ramírez-Quintana. “Quantification via probability estimators.” In Proceedings of the 2010 IEEE 10th International Conference on Data Mining (ICDM), Sydney, NSW, Australia, 13–17 December 2010; Los Alamitos, California: IEEE, 2010, pp. 737–742. [Google Scholar]
T. Fawcett, and P. Flach. “A response to Webb and Ting’s On the Application of ROC Analysis to Predict classification Performance under Varying Class Distributions.” Mach. Learn. 58 (2005): 33–38. [Google Scholar] [CrossRef]

© 2014 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tasche, D. Exact Fit of Simple Finite Mixture Models. J. Risk Financial Manag. 2014, 7, 150-164. https://doi.org/10.3390/jrfm7040150

AMA Style

Tasche D. Exact Fit of Simple Finite Mixture Models. Journal of Risk and Financial Management. 2014; 7(4):150-164. https://doi.org/10.3390/jrfm7040150

Chicago/Turabian Style

Tasche, Dirk. 2014. "Exact Fit of Simple Finite Mixture Models" Journal of Risk and Financial Management 7, no. 4: 150-164. https://doi.org/10.3390/jrfm7040150

APA Style

Tasche, D. (2014). Exact Fit of Simple Finite Mixture Models. Journal of Risk and Financial Management, 7(4), 150-164. https://doi.org/10.3390/jrfm7040150

Article Menu

Exact Fit of Simple Finite Mixture Models^†

Abstract

1. Introduction

2. The Exact Fit Property

3. Application to Quantification Problems

4. Cost Quantification

5. Loss Rates Estimation with Mixture Model Methods

6. Conclusions

Appendix: Proofs

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Exact Fit of Simple Finite Mixture Models †

Abstract

1. Introduction

2. The Exact Fit Property

3. Application to Quantification Problems

4. Cost Quantification

5. Loss Rates Estimation with Mixture Model Methods

6. Conclusions

Appendix: Proofs

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Exact Fit of Simple Finite Mixture Models^†