Exact fit of simple finite mixture models

How to forecast next year's portfolio-wide credit default rate based on last year's default observations and the current score distribution? A classical approach to this problem consists of fitting a mixture of the conditional score distributions observed last year to the current score distribution. This is a special (simple) case of a finite mixture model where the mixture components are fixed and only the weights of the components are estimated. The optimum weights provide a forecast of next year's portfolio-wide default rate. We point out that the maximum-likelihood (ML) approach to fitting the mixture distribution not only gives an optimum but even an exact fit if we allow the mixture components to vary but keep their density ratio fix. From this observation we can conclude that the standard default rate forecast based on last year's conditional default rates will always be located between last year's portfolio-wide default rate and the ML forecast for next year. As an application example, then cost quantification is discussed. We also discuss how the mixture model based estimation methods can be used to forecast total loss. This involves the reinterpretation of an individual classification problem as a collective quantification problem.


Introduction
The study of finite mixture models was initiated in the 1890s by Karl Pearson when he wanted to model multimodal densities. Research on finite mixture Dirk Tasche Prudential Regulation Authority, e-mail: dirk.tasche@gmx.net The author currently works at the Prudential Regulation Authority (a division of the Bank of England). He is also a visiting professor at Imperial College, London. The opinions expressed in this note are those of the author and do not necessarily reflect views of the Bank of England. models continued ever since but its focus changed over time as further areas of application were identified and available computational power increased. More recently the natural connection between finite mixture models and classification methods with their applications in fields like machine learning or credit scoring began to be investigated in more detail. In these applications, often it can be assumed that the mixture models are simple in the sense that the component densities are known (ie there is no dependence on unknown parameters) but their weights are unknown.
In this note, we explore a specific property of simple finite mixture models, namely that their maximum likelihood (ML) estimates provide an exact fit if they exist, and some consequences of this property. In doing so, we extend the discussion of the case 'no independent estimate of unconditional default probability given' from [13] to the multi-class case and general probability spaces.
In Section 2, we present the result on the exact fit property in a general simple finite mixture model context. In Section 3, we discuss the consequences of this result for classification and quantification problems and compare the ML estimator with other estimators that were proposed in the literature. In Section 4, we revisit the cost quantification problem as introduced in [4] as an application. In Section 5, we illustrate by a stylised example from mortgage risk management how the estimators discussed before can be deployed for the forecast of expected loss rates. Section 6 concludes the note.

The exact fit property
We discuss the properties of the ML estimator of the weights in a simple finite mixture model in a general setting which may formally be described as follows: Assumption 1 µ is a measure on (Ω, H). g > 0 is a probability density with respect to µ. Write P for the probability measure with dP dµ = g. Write E for the expectation with regard to P.
In applications, the measure µ will often be a multi-dimensional Lebesgue measure or a counting measure. We study this problem: Problem. Approximate g by a mixture of probability µ-densities f 1 , . . . , f k , In the literature, most of the time a sample version of the problem (ie with µ as an empirical measure) is discussed. Often the component densities f i depend on parameters that have to be estimated in addition to the weights p i (see [10], or [5] and for more recent surveys [12]). In this note, we consider the simple case where the component densities f i are assumed to be known and fixed. This is a standard assumption for classification (see [8]) and quantification (see [4]) problems.
Common approaches to the approximation problem are [6] and [7] for recent applications to credit risk and text categorisation respectively). The main advantage of the least squares approach compared with other approaches comes from the fact that closed-form solutions are available. [2] for a recent discussion).
In the following we examine a special property of the KL approximation which we call the exact fit property. First we note two alternative representations of the KL distance (assuming all integrals are well-defined and finite): [11] (with g = empirical measure which gives ML estimators). The authors suggested the Expectation-Maximisation (EM) algorithm involving conditional class probabilities for determining the maximum. This works well in general but sometimes suffers from very slow convergence.
The ML version of (1) had before been studied in [9]. There the authors analysed the same iteration procedure which they stated, however, in terms of densities instead of conditional probabilities. In [9], the iteration was derived differently to [11], namely by studying the gradient of the likelihood function. We revisit the approach from [9] from a different angle by starting from (2).
There is, however, the complication that g log 1 + is not necessarily integrable. But this observation does not apply to the gradient with respect to (p 1 , . . . , p k−1 ). We therefore focus on investigating the gradient. With X i as in (2), let From this we obtain for the gradient of F G j (p 1 , . . . , p k−1 ) is well-defined and finite for (p 1 , . . . , p k−1 ) ∈ S k−1 : We are now in a position to state the main result of this note. Theorem 1. Let g and µ be as in Assumption 1. Assume that X 1 , . . ., X k−1 ≥ 0 as defined for (2) are H-measurable functions on Ω. Suppose there is a vector (p 1 , . . . , p k−1 ) ∈ S k−1 such that for G i as defined in (4). Define p k = 1 − k−1 j=1 p j . Then the following two statements hold: Then it follows that q i = p i and h i = g i for g i as defined in a), for i = 1, . . . , k.
This in turn implies Hence g i dµ = 1 for all i. With regard to b) observe that Like in (5), for (q 1 , . . . , q k−1 ) ∈ S k−1 it can easily be shown that ∂Gi ∂qj is welldefined and finite. Denote by J = ∂Gi ∂qj i,j=1,...,k−1 the Jacobi matrix of the G i and X = (X 1 − 1, . . . , X k−1 − 1). Let a ∈ R k−1 and a T be the transpose of a. Then it holds that In addition, by assumption on the linear independence of the X i − 1, 0 = a J a T implies a = 0 ∈ R k−1 . Hence J is negative definite. From this it follows by the mean value theorem that the solution (p 1 , . . . , p k−1 ) of (6) is unique in S k−1 .
Hence we obtain 0 = g (Xi−1) 1+ k−1 j=1 qj (Xj −1) dµ, i = 1, . . . , k − 1. By uniqueness of the solution of (6) it follows that q i = p i , i = 1, . . . , k. With this, it can be shown similarly to (7) that Remark 1. If the KL distance on the lhs of (1) is well-defined and finite for all (p 1 , . . . , p k−1 ) ∈ S k−1 then under the assumptions of Theorem 1 b) there is a unique (p * 1 , . . . , p * k−1 ) ∈ S k−1 such that the KL distance of g and k i=1 p * i f i is minimal. In addition, by Theorem 1 a), there are densities g i such that the KL distance of g and k i=1 p * i g i is zero -this is the exact fit property of simple finite mixture models alluded to in the title of this note.
Remark 2. a) Theorem 1 b) provides a simple condition for uniqueness of the solution to (6). In the case k = 2 this condition simplifies to For k > 2 there is no similarly simple condition for the existence of a solution in S k−1 . However, as noted in [14] (Example 4.3.1), there is a simple necessary and sufficient condition for the existence of a solution in S 1 = (0, 1) to (6) in the case k = 2: b) Suppose we are in the setting of Theorem 1 a) with k > 2 and all X i > 0.
Hence there are µ-densities g 1 , . . . , g k , (p 1 , . . . , p k−1 ) ∈ S k−1 , and . Then we have another decomposition of g, namely g = By a) there is a solution p 1 ∈ (0, 1) to (9) if and only if E[X] > 1 and E[ X) −1 > 1. c) As mentioned in a), an interesting question for the application of Theorem 1 is how to find out whether or not there is a solution to (6) in S k−1 for k > 2. The iteration suggested in [9] and [11] will correctly converge to a point on the boundary of S k−1 if there is no solution in the interior (Theorem 2 of [9]). But convergence may be so slow that it may remain unclear whether a component of the limit is zero (and therefore the solution is on the boundary) or genuinely very small but positive. The straight-forward Newton-Raphson approach for determining the maximum of F defined by (3) may converge faster but may also become unstable for solutions close to or on the boundary of S k−1 . However, in case k > 2 the observation made in b) suggests that the following Gauss-Seidel-type iteration works if the initial value (p is sufficiently close to the solution (if any) of (6): -Assume that for some n ≥ 0 an approximate solution (q 1 , . . . , q k ) = (p k ) has been found.
-For i = 1, . . . , k try successively to update (q 1 , . . . , q k ) by solving (9) with component i playing the role of component 1 in b) and p 1 = q i as well as g = k j=1,j =i qj gj 1−qi . If for all i = 1, . . . , k the sufficient and necessary condition for the updated q i to be in (0, 1) is not satisfied then stop -it then is likely that there is no solution to (6) in S k−1 . Otherwise update where possible q i with the solution of (9), resulting in q i,new , and set

Application to quantification problems
Finite mixture models occur naturally in machine learning contexts. Specifically, in this note we consider the following context: A is the σ-field generated by H and the A i , ie • P 0 is a probability measure on (Ω, A) with P 0 [A i ] > 0 for i = 1, . . . , k. P 1 is a probability measure on (Ω, H). Write E i for the expectation with respect to P i . • There is a measure µ on (Ω, H) and µ-densities The space (Ω, A, P 0 ) describes the training set of a classifier. On the training set, for each example both the features (expressed by H) and the class (described by one of the A i ) are known. Note that f k > 0 implies A k / ∈ H. (Ω, H, P 1 ) describes the test set on which the classifier is deployed. On the test set only the features of the examples are known.
In mathematical terms, quantification might be described as the task to extend P 1 onto A, based on properties observed on the training set, ie of P 0 . Basically, this means to estimate prior class probabilities (or prevalences) P 1 [A i ] on the test dataset. In this note, the assumption is that P 1 A = P 0 A. In the machine learning literature, this situation is called dataset shift (see [8] and the references therein).
Specifically, we consider the following two dataset shift types (according to [8]): Under prior probability shift, the choice of suitable estimators of P 1 [A i ] is less obvious.
The following result generalises the Scaled Probability Average method of [1] to the multi-class case. It allows to derive prior probability shift estimates of prior class probabilities from covariate shift estimates as given by (10). Proposition 1. Under Assumption 2, suppose that there are q 1 ≥ 0, . . ., q k ≥ 0 with k i=1 q i = 1 such that P 1 can be represented as a simple finite mixture as follows: Then it follows that where the matrix M = (m ij ) i,j=1,...,k is given by Proof. Immediate from (11a) and the definition of conditional expectation.
For practical purposes, the representation of m ij in the first row of (11c) is more useful because most of the time no exact estimate of P 0 [A j | H] will be available. As a consequence there might be a non-zero difference between the values of the expectations in the first and second row of (11c) respectively. In contrast to the second row, for the derivation of the rhs of the first row of (11c), however, no use of the specific properties of conditional expectations has been made.
Corollary 1. In the setting of Proposition 1, suppose that k = 2. Define with var 0 denoting the variance under P 0 . Then we have Proof. We start from the first element of the vector-equation (11b) and apply some algebraic manipulations:  (12) the difference between the covariate shift estimator and the true prior probability is the smaller the greater the discriminatory power (as measured by the generalised R 2 ) of the classifier is. Moreover, both (12) and (11b) provide closed-form solutions for q 1 , . . ., q k that transform the covariate shift estimates into correct estimates under the prior probability shift assumption. In the following the estimators defined this way are called Scaled Probability Average estimators.
Corollary 1 on the relationship between covariate shift and Scaled Probability Average estimates in the binary classification case can be generalised to the relationship between covariate shift and KL distance estimates.
Corollary 2. Under Assumption 2, consider the case k = 2. Let X 1 = f1 f2 and suppose that (8b) holds for E = E 1 such that a solution p 1 ∈ (0, 1) of (6) exists. Then there is some α ∈ [0, 1] such that Proof. Suppose that g > 0 is a density of P 1 with respect to some measure ν on (Ω, H). ν need not equal µ from Assumption 2, and we can choose ν = P 1 and g = 1 if there is no other candidate. By Theorem 1 a) then there are ν-densities g 1 ≥ 0, g 2 > 0 such that g1 g2 = X 1 and g = p 1 g 1 + (1 − p 1 ) g 2 . We define a new probability measure P 0 on (Ω, A) by setting By construction of P 0 it holds that Hence we may apply Corollary 1 to obtain where R 2 0 ∈ [0, 1] is defined like R 2 0 with P 0 replaced by P 0 . Observe that also by construction of P 0 we have With the choice α = R 2 0 this proves Corollary 2.
How is the KL distance estimator (or ML estimator in case of ν being the empirical measure) of the prior class probabilities, defined by the solution of (6), in general related to the covariate shift and Scaled Probability Average estimators?
Suppose the test dataset differs from the training dataset by a prior probability shift with positive class probabilities, ie (11a) applies with q 1 , . . . , q k > 0. Under Assumption 2 and a mild linear independence condition on the ratios of the densities f i , then Theorem 1 implies that the KL distance and Scaled Probability Average estimators give the same results. Observe that in the context given by Assumption 2 the variables X i from Theorem 1 can be directly defined as X i = fi f k , i = 1, . . . , k − 1 or, equivalently by Representation (13) of the density ratios might be preferable in particular if the classifier involved has been built by binary or multinomial logistic regression. In general, by Theorem 1 the result of applying the KL distance estimator to the test feature distribution P 1 , in the quantification problem context described by Assumption 2, is a representation of P 1 as a mixture of distributions whose density ratios are the same as the density ratios of the class feature distributions P 0 [· | A i ], i = 1, . . . , k − 1.
Hence the KL distance estimator makes sense under an assumption of identical density ratios in the training and test datasets. On the one hand this assumption is similar to the assumption of identical conditional class probabilities in the covariate shift assumption but does not depend in any way on the training set prior class probabilities. This is in contrast to the covariate shift assumption where implicitly a 'memory effect' with regard to the training set prior class probabilities is accepted.
On the other hand the 'identical density ratios' assumption is weaker than the 'identical densities' assumption (the former is implied by the latter) which is part of the prior probability assumption.
One possible description of 'identical density ratios' and the related KL distance estimator is that 'identical density ratios' generalises 'identical densities' in such a way that exact fit of the test set feature distribution is achieved (which by Theorem 1 is not always possible). It therefore is fair to say that 'identical density ratios' is closer to 'identical densities' than to 'identical conditional class probabilities'.
Given training data with full information (indicated by the σ-field A in Assumption 2) and test data with information only on the features but not on the classes (σ-field H in Assumption 2), it is not possible to decide whether the covariate shift or the identical density ratios assumption is more appropriate for the data. For both assumptions result in exact fit of the test set feature distribution P 1 H but in general give quite different estimates of the test set prior class probabilities (see Corollary 2 and Section 5). Only if Eq. (6) has no solution with positive components it can be said that 'identical density ratios' does not properly describe the test data because then there is no exact fit of the test set feature distribution. In that case 'covariate shift' might not be appropriate either but at least it delivers a mathematically consistent model of the data.
If both 'covariate shift' and 'identical density ratios' provide consistent models (ie exact fit of the test set feature distribution) non-mathematical considerations of causality (are features caused by class or is class caused by features?) may help choosing the more suitable assumption. See [3] for a detailed discussion of this issue.

Cost quantification
'Cost quantification' is explained in [4] as follows: "The second form of the quantification task is for a common situation in business where a cost or value attribute is associated with each case. For example, a customer support log has a database field to record the amount of time spent to resolve each individual issue, or the total monetary cost of parts and labor used to fix the customer's problem. . . . The cost quantification task for machine learning: given a limited training set with class labels, induce a cost quantifier that takes an unlabeled test set as input and returns its best estimate of the total cost associated with each class. In other words, return the subtotal of cost values for each class." Careful reading of Section 4.2 of [4] reveals that the favourite solutions for cost quantification presented by the author essentially apply only to the case where the cost attributes are constant on the classes 1 .
Cost quantification can be more generally treated under Assumption 2 of this note. Denote by C the (random) cost associated with an example. According to the description of cost quantification quoted above then C is actually a feature of the example and, therefore, may be considered an Hmeasurable random variable under Assumption 2.
In mathematical terms, the objective of cost quantification is the estimation of the total expected cost per class 2 E 1 [C 1 Ai ], i = 1, . . . , k.
Covariate-shift assumption. Under this assumption we obtain This gives a probability-weighted version of the 'Classify & Total' estimator of [4]. 'Constant density ratios' assumption. Let X i = fi f k , i = 1, . . . , k − 1. If (6) (with µ = P 1 and g = 1) has a solution p 1 > 0, . . ., p k−1 > 0, p k = 1− k−1 j=1 p j < 1 then we can estimate the conditional class probabilities P 1 [A i | H] by 1 Only then the C + as used in Equations (4) and (5) of [4] stand for the same conditional expectations. The same observation applies to C − .
Obviously, the accuracy of the estimates on the rhs of both (14) and (15) strongly depends on the accuracy of the estimates of P 0 [A i | H] and the density ratios on the training set. Accurate estimates of these quantities, in general, will make full use of the information in the σ-field H (ie the information available at the time of estimation) and, because of the H-measurability of C, of the cost feature C. In order to achieve this, C must be used as an explanatory variable when the relationship between the classes A i and the features as reflected in H is estimated (eg by a regression approach). As onedimensional densities are relatively easy to estimate it might make sense to deploy (14) and (15) with the choice H = σ(C).
Note that this conclusion, at first glance, seems to contradict Section 5.3.1 of [4]. There it is recommended that "the cost attribute almost never be given as a predictive input feature to the classifier". Actually, with regard to the cost quantifiers suggested in [4], this recommendation is reasonable because the main component of the quantifiers as stated in (6) of [4] is correctly specified only if there is no dependence of the cost attribute C and the classifier. Not using C as an explanatory variable, however, does not necessarily imply that the dependence between C and the classifier is weak. Indeed, if the classifier has got any predictive power and C is on average different on the the different classes of examples then there must be a non-zero correlation between the cost attribute C and the output of the classifier.

Loss rates estimation with mixture model methods
Theorem 1 and the results of Section 3 have obvious applications to the problem of forecasting portfolio-wide default rates in portfolios of rated or scored borrowers. The forecast portfolio-wide default rate may be interpreted in an individual sense as a single borrower's unconditional probability of default. But there is also an interpretation in a collective sense as the forecast total proportion of defaulting borrowers. The statements of Theorem 1 and Assumption 2 are agnostic in the sense of not suggesting an individual or collective interpretation of the models under inspection. But by explaining Assumption 2 in terms of a classifier and the examples to which it is applied we have suggested an individual interpretation of the assumption.
However, there is no need to adopt this perspective on Assumption 2 and the results of Section 3. Instead of interpreting P 0 [A 1 ] as an individual example's probability of belonging to class 1 we could as well describe P 0 [A 1 ] as the proportion of a mass or substance that has property 1. If we do so we switch from an interpretation of probability spaces in terms of likelihoods associated with individuals to an interpretation in terms of proportions of parts of masses or substances.
Let us look at a retail mortgage portfolio as an illustrative example. Suppose that each mortgage has a loan-to-value (LTV) associated with it which indicates how well the mortgage loan is secured by the pledged property. Mortgage providers typically report their exposures and losses in tables that provide this information per LTV-band without specifying numbers or percentages of borrowers involved. Table 1 shows a stylised example of how such a report might look like.
This portfolio description fits well into the framework described by Assumption 2. Choose events H 1 = 'More than 100% LTV', H 2 = 'Between 90% and 100% LTV' and so on. Then the σ-field H is generated by the finite partition H 1 , . . . , H 5 . Similarly, choose A 1 = 'lost' and A 2 = 'not lost'. The measure P 0 describes last year's observations, P 1 specifies the distribution of the exposure over the LTV bands as observed at the beginning of this yearwhich is the forecast period. We can then try and replace the question marks in Table 1 by deploying the estimators discussed in Section 3. Table 2 shows the results.
Clearly, the estimates under the prior probability shift assumptions are much more sensitive to changes of the features (ie LTV bands) distribution than the estimate under the covariate shift assumption. Thus the theoretical results of Corollaries 1 and 2 are confirmed. But recall that there is no right or wrong here as all the numbers in Table 1 are purely fictitious. Nonetheless, we could conclude that in applications with unclear causalities (like for credit risk measurement) it might make sense to compute both covariate shift estimates and ML estimates (more suitable under a prior probability shift assumption) in order to gauge the possible range of outcomes.

Conclusions
We have revisited the maximum likelihood estimator (or more generally Kullback-Leibler (KL) distance estimator) of the component weights in simple finite mixture models. We have found that (if all weights of the estimate are positive) it enjoys an exact fit property which makes it even more attractive with regard to mathematical consistency. We have suggested a Gauss-Seidel-type approach to the calculation of the KL distance estimator that triggers an alarm if there is no solution with all components positive (which would indicate that the number of modelled classes may be reduced).
In the context of two-class quantification problems, as a consequence of the exact fit property we have shown theoretically and by example that the straight-forward 'covariate shift' estimator of the prior class probabilities may seriously underestimate the change of the prior probabilities if the covariate shift assumption is wrong and instead a prior probability shift has occurred. This underestimation can be corrected by the Scaled Probability Average approach which we have generalised to the multi-class case or the KL distance estimator.
As an application example, we then have discussed cost quantification, ie the attribution of total cost to classes on the basis of characterising features when class membership is unknown. In addition, we have illustrated by example that the mixture model approach to quantification is not restricted to the forecast of prior probabilities but can also be deployed for forecasting loss rates.