1. Introduction
In machine learning terminology, dataset shift refers to the phenomenon that the joint distribution of features and labels on the training dataset used for learning a model may differ from the related joint distribution on the test dataset to which the model is going to be applied; see Storkey [
1] or Moreno-Torres et al. [
2] for surveys and background information on dataset shift. Dataset shift can be the consequence of very different causes. For that reason, a catch-all treatment of general dataset shift is difficult if not impossible. As a workaround a number of specific types of dataset shift have been defined in order to introduce additional assumptions that allow for different tailor-made approaches to deal with the problem. The most familiar subtypes of dataset shift are prior probability shift and covariate shift, but more types are introduced on a continuing basis as there is a practice-driven need to do so.
Typically, under dataset shift, the test dataset observations of features are available, but the class labels cannot be observed. In this situation, it is impossible to know ex ante if covariate shift or prior probability shift (or something in between) has occurred. However, estimates of models under assumptions of covariate shift and prior probability shift, respectively, tend to differ conspicuously. As a consequence, additional assumptions need to be made in order to be able to choose between modelling options related to covariate shift and prior probability shift. Such additional assumptions may be phrased in terms of causality (Storkey [
1]): if the features can be considered “causing” the class labels, then models designed to deal with covariate shift are appropriate. Otherwise, if the class “causes” features, models targeting prior probability shift should be preferred.
He et al. [
3] recently proposed “factorizable joint shift” (FJS) which generalises both prior probability shift and covariate shift. They went on with presenting the “joint importance aligning” method for estimating the characteristics of this type of shift. At first glance, He et al. hence seemed to provide a way to avoid choosing ex ante between covariate shift and prior probability shift models. Instead, “joint importance aligning” (plus some regularisation) appeared to be a method that functioned as a covariate shift model, prior probability shift model, or combined covariate and label shift model, as required by the characteristics of the test dataset.
By a detailed analysis of factorizable joint shift in multinomial classification settings, in this paper we point out that general factorizable joint shift is not fully identifiable if no class label information on the test dataset is available and no additional assumptions are made. This is in contrast to the situations with covariate shift or prior probability shift. Therefore, circumspection is recommended with regard to potential deployment of “joint importance aligning” as proposed by He et al. [
3].
He et al. characterised factorizable joint shift by claiming that “the biases coming from the data and the label are statistically independent”. This description might not fully hit the mark. As we demonstrate in this paper, factorizable joint shift has little to do with statistical independence but should rather be interpreted as a structural property similar to the “separation of variables” which plays an important role for finding closed-form solutions to differential equations. We also argue that, in probabilistic terms, factorizable joint shift perhaps is better described as “scaled density ratios” shift.
The plan of this paper and its main research contributions are as follows:
Section 2 “Setting the scene” presents the assumptions, concepts and notation for the multinomial (or multiclass) classification setting of this paper.
Section 3 “General dataset shift in multinomial classification” introduces a normal form for the joint density of features and class labels (Theorem 1) and derives in Corollary 2 a generalisation of the correction formula for class posterior probabilities of Saerens et al. [
4] and Elkan [
5].
Section 4 “Factorizable joint shift” defines this kind of dataset shift in a mathematically rigorous manner and presents a full representation in terms of the source (training) distribution, the target (test) prior class probabilities and the target marginal distribution of the features (Theorem 2). In addition, a specific version of the posterior correction formula is given (Corollary 4), and the description of factorizable joint shift as “scaled density ratios” shift is motivated. Moreover, alternatives to the “joint importance aligning” of He et al. [
3] are proposed (
Section 4.1).
Section 5 “Common types of dataset shift” examines in a mathematically rigorous manner for a number of types of dataset shift mentioned in the literature if they are implied by or imply factorizable joint shift. The types of dataset shift treated in this section are prior probability shift, covariate shift, covariate shift with posterior drift, domain invariance and generalised label shift. In addition, the posterior correction formulae specific for these types of dataset shift are presented.
Section 6 “Sample selection bias” revisits the topic of dataset shift caused by sample selection bias and looks at the question of how the class-wise selection probabilities look like if the induced dataset shift is factorizable joint shift (Theorem 3).
Section 7 “Conclusions” provides a short discussion of the important findings of the paper and points to some open research questions.
2. Setting the Scene
In this paper, we use the following population-level description of the multinomial classification problem under dataset shift in terms of measure theory. See standard textbooks on probability theory like Billingsley [
6] or Klenke [
7] for formal definitions and background of the notions introduced in Assumption 1. See Tasche [
8] for a detailed reconciliation of the setting of this paper with the concepts and notation used in the mainstream machine learning literature.
Assumption 1. is a measurable space. The source distribution P and the target distribution Q are probability measures on . For some positive integer , events and a sub-σ-algebra are given. The events , , and have the following properties:
- (i)
.
- (ii)
, , .
- (iii)
, .
- (iv)
, .
- (v)
, .
In the literature, P is also called “source domain” or “training distribution” while Q is also referred to as “target domain” or “test distribution’.
The elements of are objects (or instances) with class (label) and covariate (or feature) attributes. means that belongs to class i (or the positive class in the binary case if ).
The
-algebra
of events
is a collection of subsets
F of
with the property that they can be assigned probabilities
and
in a logically consistent way. In the literature, thanks to their role of reflecting the available information,
-algebras are sometimes also called “information set” (Holzmann and Eulert [
9]). In the following, we use both terms exchangeably.
The sub--algebra generated by the covariates (features) contains the events which are observable at the time when the class of an object has to be predicted. Since , , then the class of the object may not yet be known. In this paper, we assume that under the source distribution P, the class events can be observed such that the prior class probabilities can be estimated. In contrast, under the target distribution Q, the events cannot be directly observed and can only be predicted on the basis of the events , which are assumed to reflect the features of the object.
For technical reasons, it is convenient to define the joint information set of features and class labels:
Definition 1. We denote by the minimal sub-σ-algebra of containing all , and by the minimal sub-σ-algebra of containing both and , i.e., .
Note that the
-algebra
can be represented as
while the
-algebra
can be written as
A standard assumption in machine learning is that source and target distribution are the same, i.e.,
. The situation where
holds for at least one
is called
dataset shift (Moreno-Torres et al. [
2], Definition 1).
Under dataset shift as defined this way, typically, classifiers or posterior class probabilities learnt under the source distribution stop working properly under the target distribution. Finding algorithms to deal with this problem is one of the tasks in the field of domain adaptation.
In this paper, we are mostly interested in exploring how posterior class probabilities change between a source and a target distribution as described in Assumption 1. In particular, we provide generalisations of the posterior correction formula (2.4) of Saerens et al. [
4] (see also Theorem 2 of Elkan [
5]). For this purpose, the notions of conditional expectation and conditional probability are crucial.
In the following,
denotes conditional or unconditional expectation with respect to the probability measure
P. For a given probability space
, we refer to Section 8.2 of Klenke [
7] for the formal definitions and properties of
In the machine learning literature, often the term posterior class probability rather than conditional probability is used to refer to the conditional probabilities and , , in the context of Assumption 1. In contrast, the term prior probability is used for the probabilities and , which in our measure-theoretic setting should rather be called unconditional probabilities of .
An assumption of absolute continuity is also crucial for an investigation of how the posterior class probabilities are impacted by a change from the source distribution to the target distribution. Formally, this assumption reads as follows:
Assumption 2. Assumption 1 holds, and Q is absolutely continuous with respect to P on , i.e.,where stands for the measure M with domain restricted to . The statement “Q is absolutely continuous with respect to P on ” means that for all events , implies . Hence, “impossible” events under P are also impossible under Q. Measure-theoretic impossibility is somewhat unintuitive because for continuous distributions each single outcome has probability 0 and therefore is impossible. Nonetheless, sampled values from such distributions are single outcomes and occur despite having probability 0.
However, the statement “for all events
,
implies
” is equivalent to saying: for all events
,
implies
. This means that “possible” events under
Q are also possible events under
P, even if with very tiny probabilities of occurrence. This phrasing of absolute continuity is more intuitive and is preferred by some authors, for instance by He et al. [
3] who in
Section 2 make the assumption
⇒
, which they seem to understand in the sense of Assumption 2.
As mentioned before, if the target distribution
Q is absolutely continuous with respect to
P, there may be events whose probabilities under
Q are much greater than their probabilities under
P. From a practical point of view, such events may even appear to be “impossible” under
P. Notions such as “sufficient support” and “support sufficiency divergence” (Johansson et al. [
10]) suggest that such is the view of the machine learning community. Hence, Assumption 2 is not necessarily in contrast to the working assumption of partially or fully nonoverlapping source and target domains made by many researchers in unsupervised domain adaptation.
For analyses of the case of domains where the source does not completely cover the target (such that Assumption 2 may be violated), see Johannsson et al. [
10]. However, the statement of Johannsson et al.,
Section 5, “If this overlap is increased without losing information, such as through collection of additional samples, this is usually preferable.” suggests that an assumption of nonoverlapping support is not the same as an assumption on a lack of absolute continuity. For according to the statement by Johannsson et al., events outside of the source support do not appear to be impossible because in that case the “collection of additional samples” could not increase the support overlap between source and target.
Assumption 2 is stronger than the common assumption of absolute continuity on
(see for instance, Scott [
11]), but in terms of interpretation there is no big difference: all events possible under the target distribution (including in label space) are also possible under the source distribution.
An important consequence of Assumption 2 is that we can use the source distribution P as a reference measure for the target distribution Q. This is more natural than introducing another measure without real-world meaning as a reference for both P and Q. In addition, renouncing another measure as a reference has the advantageous effect of simplifying notation.
Recall the following common conventions intended to make the measure-theoretic notation more incisive:
Notation 1. An important consequence of deploying a measure-theoretic framework as in this paper is that real-valued random variables X on a fixed probability space are uniquely defined only up to events of probability 0 and may be undefined or ill-defined on such events or when being multiplied with the factor 0. To be more specific:
If is another random variable such that , then exists if and only if exists. In this case, follows.
If X is undefined or ill-defined on an event with , then by definition exists if and only if exists for In this case, is defined as .
If X is undefined or ill-defined on an event but is multiplied with another random variable Z which takes the value 0 on F, then, by definition, exists if and only if exists for In this case, is defined as .
The conventions listed in Notation 1 are convenient and used frequently in the following text. Note, however, that they are only valid in the context of a fixed probability measure P. For instance, under Assumption 2, if the event N where the random variable X has probability 0 under the source distribution P of being undefined, i.e., , then follows as well, such that should be well-defined. Nonetheless, does not necessarily imply such that might be well-defined despite being ill-defined.
In the same vein, under Assumption 2, for the posterior class probabilities , , the expectations are well-defined. However, for the posterior class probabilities , , the expectations are potentially ill-defined because there could be versions of which are indistinguishable under Q but different with positive probability under P. In the following, we are careful to avoid such issues whenever the discussion involves more than one probability measure.
3. General Dataset Shift in Multinomial Classification
Under Assumption 2, by the Radon–Nikodym theorem, there is an
-measurable density
of the target distribution
Q with respect to the target distribution
P on the joint information set
defined by (
1b). This density links
Q to
P by Equation (
2):
In (
2) and in the remainder of the paper,
denotes the indicator function of
F, defined by
if
and
if
.
Unfortunately, in practice is more or less unobservable. Therefore, it is desirable to decompose it into smaller parts which may be observable or can perhaps be determined through reasonable assumptions. The key step to such a decomposition is made with the following combination of definitions and lemma.
Definition 2. Under Assumption 1, define the following class-conditional distributions, by letting for and In the literature, when restricted to the feature information set , the and sometimes are called class-conditional feature distributions.
Lemma 1. Under Assumption 2, for , the class-conditional feature distribution is absolutely continuous with respect to on .
Denote by a Radon–Nikodym derivative (or density) of with respect to . If there is another -measurable function with the density property, i.e., for all , then it follows that Proof. Fix
i and choose any
with
. Then, it follows that
and
. By Assumption 2,
follows, which implies
Hence, we have
from which
follows. The uniqueness of Radon–Nikodym derivatives implies
and hence the right-hand side of (
4). However, by the definition of conditional probability it also follows that
This implies the left-hand side of (
4). □
With Lemma 1 as preparation, we are in a position to state the following key representation result and some corollaries for the joint density
of features and class labels. In the remainder of this paper, we make use of (
5) as a normal form for
.
Theorem 1. Under Assumption 2, the density of Q with respect to P on can be represented aswhere the are any densities of with respect to on as introduced in Lemma 1, for . Proof. Let
. By (
1b), then it holds that
This implies
Equation (
5) follows from this by the definition of Radon–Nikodym derivatives. □
Corollary 1. Under Assumption 2, the density h of Q with respect to P on can be written as Proof. The corollary follows from Theorem 1 because . □
Corollary 2. Under Assumption 2, for , the conditional probability (posterior class probability) can be represented ason the set , where h denotes the denominator of the right-hand side of (6) (and the density of Q with respect to P on , as introduced in Corollary 1). Equation (
6) generalises Equation (2.4) of Saerens et al. [
4] and Theorem 2 of Elkan [
5] from prior probability shift to general dataset shift. Saerens et al. commented on their Equation (2.4) as follows: “This well-known formula can be used to compute the corrected a posteriori probabilities,
…”. Hence, in this paper we call (
6) the
posterior correction formula.
Recall that under Assumption 2, it holds that
while
is possible. Hence,
is fully specified by (
6) under
Q but possibly only incompletely specified under
P.
Proof of Corollary 2. Apply the generalised Bayes formula (see Lemma A1 in
Appendix A) with
,
,
and
. □
A direct application of the posterior correction formula (
6) is not possible because the target prior probabilities
and the target class conditional feature densities
typically are unknown. However, in some cases the target priors might be known from external sources such as central banks, IMF or national offices of statistics. Under more specific assumptions on the type of dataset shift, it may be possible to estimate the target priors from the target dataset. See González et al. [
12] for a survey of estimation methods under the assumption of prior probability shift.
Under prior probability shift,
is assumed for all
i (see
Section 5.1 below). This means there is no change of the conditional feature distributions. This assumption might be too strong in some situations. It might be more promising to assume similar changes for all classes (i.e.,
for
), for instance, by assuming factorizable joint shift (see
Section 4 below), or by trying to find transformations (or representations) of the features that make the resulting feature densities similar (see
Section 5.4 and
Section 5.5 below).
For the sake of completeness, we also mention the following alternative representation (
7b) of
. Compared to (
7b), (
5) provides more structural information, in particular when taking into account Corollary 2 above and, therefore, is potentially more useful.
Corollary 3. Under Assumption 2, let h be a density of Q with respect to P on . Then, the target posterior class probabilities vanish on the event if the source posterior class probabilities vanish on , i.e., it holds on thatMoreover, the density of Q with respect to P on can be represented as Proof. Equation (
7a) follows immediately from Corollary 2. Taking into account Notation 1 for the meaning of (
7b) on the event
, the equation follows from (
1b) and the definition of the posterior class probabilities. □
The following result may be considered an inversion of the previous results and in particular Corollary 2 on the relationship between source and target distributions. It is of interest mostly for dealing with sample selection bias (see
Section 6 below).
Proposition 1. In the setting of Theorem 1, assume additionally that holds. Then, the following statements hold true:
- (i)
P is absolutely continuous with respect to Q on , with .
- (ii)
For , the source class-conditional feature distribution is absolutely continuous with respect to on , with and - (iii)
The density can also be represented as - (iv)
The density can be represented as - (v)
For , it holds that
Proof. (i) is a well-known property of equivalent probability measures (see Problem 32.6 of Billingsley [
6]).
By (i),
P is absolutely continuous with respect to
Q on
. This implies that
is absolutely continuous with respect to
on
and, again by Problem 32.6 of [
6], the rest of (ii) follows as well.
Properties (iii), (iv) and (v) follow from (i) and (ii), by making use of Theorem 1 and Corollaries 1 and 2 with swapped roles of P and Q. □
6. Sample Selection Bias
Sample selection bias is an important cause of dataset shift. In this subsection, we revisit parts of Hein [
23] in order to illustrate some of the concepts and results presented before. We basically work under Assumption 1 but without the interpretation of
P as source and
Q as target distribution. Instead,
P is interpreted as the distribution of a population from which a potentially biased random sample is taken, resulting in the distribution
Q. When studying sample selection bias in this setting, the goal is to infer properties of
P from properties of the sample distribution
Q.
The following assumption describes the setting of this section. The idea is that under the population distribution, each object has a positive chance to be selected. This chance may depend upon the features (covariates) and the class of the object.
Assumption 3 (Sample selection). is a measurable space. The population distribution P is a probability measure on . For some positive integer , events and a sub-σ-algebra are given. The events , and have the following properties:
- (i)
.
- (ii)
, , .
- (iii)
, .
- (iv)
, .
The selection probability
is an -measurable random variable where the sub-σ-algebra is defined as in (1b). The probability space also supports a random variable U which is uniformly distributed on such that U and are independent.
Definition 4 (Sample distribution)
. Under Assumption 3, define the event of being selected
by . The probability measure Q on , defined byis called sample distribution.
Note that the measure
Q is well-defined because from the independence of
U and
, it follows that
Another consequence of the independence of
U and
is
Proposition 5. P and Q as described in Assumption 3 and Definition 4 satisfy Assumptions 1 and 2 with P as source distribution and Q as target distribution. Moreover, P is absolutely continuous with respect to Q on .
Proof. It remains to show that
Q is absolutely continuous with respect to P on , with density ;
P is absolutely continuous with respect to Q on ;
for .
By definition of
Q as
P conditional on
S, the sample distribution
Q is absolutely continuous with respect to
P on
and hence also on
. For the density
, we obtain
The fact that is positive implies that P is absolutely continuous with respect to Q on . Since for , the absolute continuity of P with respect to Q implies , . □
6.1. Properties of the Sample Selection Model
Equation (
25) implies for the density
h of
Q with respect to
P on
From representation (
1b) of
, the following alternative description for
follows:
where the
denote the class-conditional feature distributions under
P, see Definition 2.
is accordingly the feature-conditional probability of being selected on the subpopulation of objects with class
.
For
and
, a short calculation shows:
Equation (
28) and Theorem 1 together imply the following alternative representation of
:
By the generalised Bayes formula (Lemma A1 in
Appendix A), (
25) implies the following representation of the posterior class probabilities
,
, under
Q:
Zadrozny [
24] and Hein [
23] observed that if the event
S of being selected and the class labels as expressed by the
-algebra
were independent conditional on
, the information set reflecting the features, then the population distribution
P and the sample distribution
Q were related by covariate shift. A consequence of (
30) is that the converse of this observation actually also holds true, as stated in the following proposition.
Proposition 6. In the sample selection model, as specified by Assumption 3 and Definition 4, the population distribution P and the sample distribution Q are related by covariate shift if and only ifi.e., if the event of being selected and the class labels are independent conditional on the features under the population distribution P. Proof. Proposition 6 is obvious from (
30) and the definition of covariate shift (
17). □
In the case of general dataset shift caused by sample selection, Equation (
30) does not provide information about how to compute the population posterior class probabilities
from the sample posterior class probabilities
. Translated into the setting of this paper, Hein [
23] presented in Equation (3.2) the following two ways to do so:
Define
as the distribution of the not-selected sample, i.e.,
Then, it holds that
for
.
Equation (
30) can be written equivalently as
Hence, on the event
, the following representation of
,
, is obtained:
Both (
31a) and (
31b) are of limited practical usefulness, however, as on the one hand, (
31a) requires knowledge of the class labels in the not-selected sample, which usually are not available. On the other hand, for (
31b) to be applicable, class-wise probabilities of selection
must be estimated, which again requires knowledge of the class labels in the not-selected sample.
6.2. Sample Selection Bias and Factorizable Joint Shift
Proposition 6 provides an example of a condition for the sample selection process that makes the resulting bias between population and sample representable as covariate shift and, consequently, according to
Section 5.2, as a special case of factorizable joint shift. Are there other selection procedures that entail factorizable joint shift?
We investigate this question by assuming that the population distribution P and the sample distribution Q are related by factorizable joint shift and then identifying the consequences this assumption implies for the class-wise feature-conditional selection probabilities , .
Theorem 3. Under Assumption 3 and Definition 4, let P and Q be related by factorizable joint shift in the sense of Definition 3, i.e., there are an -measurable function and an -measurable function such that the density of Q with respect to P on can be represented as . Then, the following statements hold true:
- (i)
Q and P are related by factorizable joint shift with an -measurable function and an -measurable function that can be represented up to a constant factor in the sense of (8b) aswhere the constants satisfy the following equation system, with : - (ii)
The population posterior probabilities , , can be represented as functions of the sample posterior probabilities , , in the following way:where the constants satisfy equation system (32b). - (iii)
The class-wise feature-conditional selection probabilities , , can be represented aswhere the constants satisfy equation system (32b) and .
Proof. Functions
g and
b must be positive since
is positive according to Proposition 5. Hence,
Q and
P are related by factorizable joint shift with decomposition
and
. Apply Theorem 2 with swapped roles of
P and
Q to obtain representation (
32a) and equation system (
32b). Statement (ii) follows immediately from Corollary 4.
Regarding (iii), use (
28) and Proposition 1 (iv) together with (
32a) to obtain
This is equivalent to (
34). □
As mentioned in
Section 4.1 as a potential application of Theorem 2, assuming that the posterior probabilities
under the sample distribution can be estimated, Theorem 3 offers two obvious ways to learn the characteristics of factorizable joint shift:
- (a)
If the population prior class probabilities
are known (for instance from external sources) solve (
32b) for the constants
.
- (b)
If the population prior class probabilities
are unknown, fix values for the constants
and solve (
32b) for the
. Letting
for all
i is a natural choice that converts (
32b) into the system of maximum likelihood equations for the
under the prior probability shift assumption.
In case (a), (
34) may serve as an admissibility check for the solutions found. If the class-wise selection probabilities
obtained from (
34) can take values greater than
, the corresponding set of values
is not an admissible solution of (
32b). If all solutions
of (
32b) turn out to be inadmissible, it must be concluded that the assumption of factorizable joint shift for the sample selection process is wrong.
In case (b), from (
34) follows for all
which implies
Inequality (
35) provides a simple necessary criterion for the presence of factorizable joint shift with constants
all equal to 1.
A further, less obvious special case of Theorem 3 is encountered if is assumed that
Then, (
34) implies
for all
. By (
31b), this means that population distribution and sample distribution are related by covariate shift, as already observed by Hein [
23].