Factorizable Joint Shift in Multinomial Classification

Factorizable joint shift (FJS) was recently proposed as a type of dataset shift for which the complete characteristics can be estimated from feature data observations on the test dataset by a method called Joint Importance Aligning. For the multinomial (multiclass) classification setting, we derive a representation of factorizable joint shift in terms of the source (training) distribution, the target (test) prior class probabilities and the target marginal distribution of the features. On the basis of this result, we propose alternatives to joint importance aligning and, at the same time, point out that factorizable joint shift is not fully identifiable if no class label information on the test dataset is available and no additional assumptions are made. Other results of the paper include correction formulae for the posterior class probabilities both under general dataset shift and factorizable joint shift. In addition, we investigate the consequences of assuming factorizable joint shift for the bias caused by sample selection.


Introduction
In machine learning terminology, dataset shift refers to the phenomenon that the joint distribution of features and labels on the training dataset used for learning a model may differ from the related joint distribution on the test dataset to which the model is going to be applied; see Storkey [1] or Moreno-Torres et al. [2] for surveys and background information on dataset shift. Dataset shift can be the consequence of very different causes. For that reason, a catch-all treatment of general dataset shift is difficult if not impossible. As a workaround a number of specific types of dataset shift have been defined in order to introduce additional assumptions that allow for different tailor-made approaches to deal with the problem. The most familiar subtypes of dataset shift are prior probability shift and covariate shift, but more types are introduced on a continuing basis as there is a practice-driven need to do so.
Typically, under dataset shift, the test dataset observations of features are available, but the class labels cannot be observed. In this situation, it is impossible to know ex ante if covariate shift or prior probability shift (or something in between) has occurred. However, estimates of models under assumptions of covariate shift and prior probability shift, respectively, tend to differ conspicuously. As a consequence, additional assumptions need to be made in order to be able to choose between modelling options related to covariate shift and prior probability shift. Such additional assumptions may be phrased in terms of causality (Storkey [1]): if the features can be considered "causing" the class labels, then models designed to deal with covariate shift are appropriate. Otherwise, if the class "causes" features, models targeting prior probability shift should be preferred.
He et al. [3] recently proposed "factorizable joint shift" (FJS) which generalises both prior probability shift and covariate shift. They went on with presenting the "joint importance aligning" method for estimating the characteristics of this type of shift. At first glance, He et al. hence seemed to provide a way to avoid choosing ex ante between covariate shift and prior probability shift models. Instead, "joint importance aligning" (plus some regularisation) appeared to be a method that functioned as a covariate shift model, prior probability shift arXiv:2207.14514v2 [stat.ML] 16 Sep 2022 model, or combined covariate and label shift model, as required by the characteristics of the test dataset.
By a detailed analysis of factorizable joint shift in multinomial classification settings, in this paper we point out that general factorizable joint shift is not fully identifiable if no class label information on the test dataset is available and no additional assumptions are made. This is in contrast to the situations with covariate shift or prior probability shift. Therefore, circumspection is recommended with regard to potential deployment of "joint importance aligning" as proposed by He et al. [3].
He et al. characterised factorizable joint shift by claiming that "the biases coming from the data and the label are statistically independent". This description might not fully hit the mark. As we demonstrate in this paper, factorizable joint shift has little to do with statistical independence but should rather be interpreted as a structural property similar to the "separation of variables" which plays an important role for finding closed-form solutions to differential equations. We also argue that, in probabilistic terms, factorizable joint shift perhaps is better described as "scaled density ratios" shift.
The plan of this paper and its main research contributions are as follows: • Section 2 "Setting the scene" presents the assumptions, concepts and notation for the multinomial (or multiclass) classification setting of this paper. • Section 3 "General dataset shift in multinomial classification" introduces a normal form for the joint density of features and class labels (Theorem 1) and derives in Corollary 2 a generalisation of the correction formula for class posterior probabilities of Saerens et al. [4] and Elkan [5]. • Section 4 "Factorizable joint shift" defines this kind of dataset shift in a mathematically rigorous manner and presents a full representation in terms of the source (training) distribution, the target (test) prior class probabilities and the target marginal distribution of the features (Theorem 2). In addition, a specific version of the posterior correction formula is given (Corollary 4), and the description of factorizable joint shift as "scaled density ratios" shift is motivated. Moreover, alternatives to the "joint importance aligning" of He et al. [3] are proposed (Section 4.1). • Section 5 "Common types of dataset shift" examines in a mathematically rigorous manner for a number of types of dataset shift mentioned in the literature if they are implied by or imply factorizable joint shift. The types of dataset shift treated in this section are prior probability shift, covariate shift, covariate shift with posterior drift, domain invariance and generalised label shift. In addition, the posterior correction formulae specific for these types of dataset shift are presented. • Section 6 "Sample selection bias" revisits the topic of dataset shift caused by sample selection bias and looks at the question of how the class-wise selection probabilities look like if the induced dataset shift is factorizable joint shift (Theorem 3). • Section 7 "Conclusions" provides a short discussion of the important findings of the paper and points to some open research questions.

Setting the Scene
In this paper, we use the following population-level description of the multinomial classification problem under dataset shift in terms of measure theory. See standard textbooks on probability theory like Billingsley [6] or Klenke [7] for formal definitions and background of the notions introduced in Assumption 1. See Tasche [8] for a detailed reconciliation of the setting of this paper with the concepts and notation used in the mainstream machine learning literature.

Assumption 1.
(Ω, F ) is a measurable space. The source distribution P and the target distribution Q are probability measures on (Ω, F ). For some positive integer d ≥ 2, events A 1 , . . . , A d ∈ F and a sub-σ-algebra H ⊂ F are given. The events A i , i = 1, . . . , d, and H have the following properties: In the literature, P is also called "source domain" or "training distribution" while Q is also referred to as "target domain" or "test distribution'.
The elements ω of Ω are objects (or instances) with class (label) and covariate (or feature) attributes. ω ∈ A i means that ω belongs to class i (or the positive class in the binary case if i = 1).
The σ-algebra F of events F ∈ F is a collection of subsets F of Ω with the property that they can be assigned probabilities P[F] and Q[F] in a logically consistent way. In the literature, thanks to their role of reflecting the available information, σ-algebras are sometimes also called "information set" (Holzmann and Eulert [9]). In the following, we use both terms exchangeably.
The sub-σ-algebra H ⊂ F generated by the covariates (features) contains the events which are observable at the time when the class of an object ω has to be predicted. Since A i / ∈ H, i = 1, . . . , d, then the class of the object may not yet be known. In this paper, we assume that under the source distribution P, the class events A i can be observed such that the prior class probabilities can be estimated. In contrast, under the target distribution Q, the events A i cannot be directly observed and can only be predicted on the basis of the events H ∈ H, which are assumed to reflect the features of the object.
For technical reasons, it is convenient to define the joint information set H of features and class labels: Definition 1. We denote by A = σ({A 1 , . . . , A d }) the minimal sub-σ-algebra of F containing all A i , i = 1, . . . , d and by H the minimal sub-σ-algebra of F containing both H and A, i.e., H = σ(H ∪ A).
Note that the σ-algebra A can be represented as while the σ-algebra H can be written as A standard assumption in machine learning is that source and target distribution are the same, i.e., P = Q. The situation where P[F] = Q[F] holds for at least one F ∈ H is called dataset shift (Moreno-Torres et al. [2], Definition 1).
Under dataset shift as defined this way, typically, classifiers or posterior class probabilities learnt under the source distribution stop working properly under the target distribution. Finding algorithms to deal with this problem is one of the tasks in the field of domain adaptation.
In this paper, we are mostly interested in exploring how posterior class probabilities change between a source and a target distribution as described in Assumption 1. In particular, we provide generalisations of the posterior correction formula (2.4) of Saerens et al. [4] (see also Theorem 2 of Elkan [5]). For this purpose, the notions of conditional expectation and conditional probability are crucial.
In the following, E P denotes conditional or unconditional expectation with respect to the probability measure P. For a given probability space (Ω, F , P), we refer to Section 8.2 of Klenke [7] for the formal definitions and properties of

•
The expectation E P [X | H] of a real-valued random variable X conditional on a sub-σalgebra H; • The probability P[F | H] of an event F ∈ F conditional on H.
In the machine learning literature, often the term posterior class probability rather than conditional probability is used to refer to the conditional probabilities P[A i | H] and Q[A i | H], i = 1, . . . , d, in the context of Assumption 1. In contrast, the term prior probability is used for the probabilities P[A i ] and Q[A i ], which in our measure-theoretic setting should rather be called unconditional probabilities of A i .
An assumption of absolute continuity is also crucial for an investigation of how the posterior class probabilities are impacted by a change from the source distribution to the target distribution. Formally, this assumption reads as follows: Assumption 2. Assumption 1 holds, and Q is absolutely continuous with respect to P on H, i.e.,

Q|H P|H,
where M|H stands for the measure M with domain restricted to H.
The statement "Q is absolutely continuous with respect to P on H" means that for all events N ∈ H, P[N] = 0 implies Q[N] = 0. Hence, "impossible" events under P are also impossible under Q. Measure-theoretic impossibility is somewhat unintuitive because for continuous distributions each single outcome has probability 0 and therefore is impossible. Nonetheless, sampled values from such distributions are single outcomes and occur despite having probability 0.
However, the statement "for all events N ∈ H, P[N] = 0 implies Q[N] = 0" is equivalent to saying: for all events N ∈ H, Q[N] > 0 implies P[N] > 0. This means that "possible" events under Q are also possible events under P, even if with very tiny probabilities of occurrence. This phrasing of absolute continuity is more intuitive and is preferred by some authors, for instance by He et al. [3] who in Section 2 make the assumption D T (x, y) > 0 ⇒ D S (x, y) > 0, which they seem to understand in the sense of Assumption 2.
As mentioned before, if the target distribution Q is absolutely continuous with respect to P, there may be events whose probabilities under Q are much greater than their probabilities under P. From a practical point of view, such events may even appear to be "impossible" under P. Notions such as "sufficient support" and "support sufficiency divergence" (Johansson et al. [10]) suggest that such is the view of the machine learning community. Hence, Assumption 2 is not necessarily in contrast to the working assumption of partially or fully nonoverlapping source and target domains made by many researchers in unsupervised domain adaptation.
For analyses of the case of domains where the source does not completely cover the target (such that Assumption 2 may be violated), see Johannsson et al. [10]. However, the statement of Johannsson et al., Section 5, "If this overlap is increased without losing information, such as through collection of additional samples, this is usually preferable." suggests that an assumption of nonoverlapping support is not the same as an assumption on a lack of absolute continuity. For according to the statement by Johannsson et al., events outside of the source support do not appear to be impossible because in that case the "collection of additional samples" could not increase the support overlap between source and target. Assumption 2 is stronger than the common assumption of absolute continuity on H (see for instance, Scott [11]), but in terms of interpretation there is no big difference: all events possible under the target distribution (including in label space) are also possible under the source distribution.
An important consequence of Assumption 2 is that we can use the source distribution P as a reference measure for the target distribution Q. This is more natural than introducing another measure without real-world meaning as a reference for both P and Q. In addition, renouncing another measure as a reference has the advantageous effect of simplifying notation.
Recall the following common conventions intended to make the measure-theoretic notation more incisive: Notation 1. An important consequence of deploying a measure-theoretic framework as in this paper is that real-valued random variables X on a fixed probability space (Ω, F , P) are uniquely defined only up to events of probability 0 and may be undefined or ill-defined on such events or when being multiplied with the factor 0. To be more specific: The conventions listed in Notation 1 are convenient and used frequently in the following text. Note, however, that they are only valid in the context of a fixed probability measure P. For instance, under Assumption 2, if the event N where the random variable X has probability 0 under the source distribution P of being undefined, i.e., In the same vein, under Assumption 2, for the posterior class probabilities P[ with positive probability under P. In the following, we are careful to avoid such issues whenever the discussion involves more than one probability measure.

General Dataset Shift in Multinomial Classification
Under Assumption 2, by the Radon-Nikodym theorem, there is an H-measurable density h = dQ dP H of the target distribution Q with respect to the target distribution P on the joint information set H defined by (1b). This density links Q to P by Equation (2): In (2) and in the remainder of the paper, 1 F denotes the indicator function of F, defined by Unfortunately, in practice h is more or less unobservable. Therefore, it is desirable to decompose it into smaller parts which may be observable or can perhaps be determined through reasonable assumptions. The key step to such a decomposition is made with the following combination of definitions and lemma.

Definition 2.
Under Assumption 1, define the following class-conditional distributions, by letting for F ∈ F and i = 1, . . . , d In the literature, when restricted to the feature information set H, the P i and Q i sometimes are called class-conditional feature distributions. Lemma 1. Under Assumption 2, for i = 1, . . . , d, the class-conditional feature distribution Q i is absolutely continuous with respect to P i on H.

Proof. Fix i and choose any
Hence, we have Q i |H P i |H from which Q i |H P i |H follows. The uniqueness of Radon-Nikodym derivatives implies and hence the right-hand side of (4). However, by the definition of conditional probability it also follows that . This implies the left-hand side of (4).
With Lemma 1 as preparation, we are in a position to state the following key representation result and some corollaries for the joint density h of features and class labels. In the remainder of this paper, we make use of (5) as a normal form for h. Theorem 1. Under Assumption 2, the density h of Q with respect to P on H can be represented as where the h i are any densities of Q i with respect to P i on H as introduced in Lemma 1, for i = 1, . . . , d.
Proof. Let F ∈ H. By (1b), then it holds that This implies Equation (5) follows from this by the definition of Radon-Nikodym derivatives.

Corollary 1.
Under Assumption 2, the density h of Q with respect to P on H can be written as on the set {h > 0}, where h denotes the denominator of the right-hand side of (6) (and the density of Q with respect to P on H, as introduced in Corollary 1).
Equation (6) A direct application of the posterior correction formula (6) is not possible because the target prior probabilities Q[A i ] and the target class conditional feature densities h i typically are unknown. However, in some cases the target priors might be known from external sources such as central banks, IMF or national offices of statistics. Under more specific assumptions on the type of dataset shift, it may be possible to estimate the target priors from the target dataset. See González et al. [12] for a survey of estimation methods under the assumption of prior probability shift.
Under prior probability shift, h i = 1 is assumed for all i (see Section 5.1 below). This means there is no change of the conditional feature distributions. This assumption might be too strong in some situations. It might be more promising to assume similar changes for all classes (i.e., h i ≈ h j for i = j), for instance, by assuming factorizable joint shift (see Section 4 below), or by trying to find transformations (or representations) of the features that make the resulting feature densities similar (see Sections 5.4 and 5.5 below).
For the sake of completeness, we also mention the following alternative representation (7b) of h = dQ dP H. Compared to (7b), (5) provides more structural information, in particular when taking into account Corollary 2 above and, therefore, is potentially more useful.
Moreover, the density h of Q with respect to P on H can be represented as Proof. Equation (7a) follows immediately from Corollary 2. Taking into account Notation 1 for the meaning of (7b) on the event {P[A i | H] = 0}, the equation follows from (1b) and the definition of the posterior class probabilities.
The following result may be considered an inversion of the previous results and in particular Corollary 2 on the relationship between source and target distributions. It is of interest mostly for dealing with sample selection bias (see Section 6 below). Proposition 1. In the setting of Theorem 1, assume additionally that P[h = 0] = 0 holds. Then, the following statements hold true: (i) P is absolutely continuous with respect to Q on H, with dP (iii) The density dP dQ H can also be represented as (iv) The density dP dQ H can be represented as Proof. (i) is a well-known property of equivalent probability measures (see Problem 32.6 of Billingsley [6]). By (i), P is absolutely continuous with respect to Q on H. This implies that P i is absolutely continuous with respect to Q i on H and, again by Problem 32.6 of [6], the rest of (ii) follows as well.
Properties (iii), (iv) and (v) follow from (i) and (ii), by making use of Theorem 1 and Corollaries 1 and 2 with swapped roles of P and Q.

Factorizable Joint Shift
The following definition translates Definition 2.2 of He et al. [3] into the setting of this paper.

Definition 3.
Under Assumption 2, we say that the target distribution Q is related to the source distribution P by factorizable joint shift (FJS), if there are a non-negative H-measurable function g and a non-negative A-measurable function b such that the density h of Q with respect to P on H can be represented as Observe that the functions g and b of Definition 3 are not uniquely determined because for any c > 0 the functions g c = c g and b c = b/c are also H-measurable and A-measurable, respectively, and satisfy In the remainder of this section, we show that the functions g and b depend on the source distribution P as well as the marginal distributions of Q on H and A, respectively, but not on the joint distribution Q|H. For the case d = 2, in Section 4.2 below we obtain the stronger result that Then, up to a constant factor c as in (8b), it follows that where the constants 1 , . . . , d−1 are positive and finite and satisfy the following equation system: . ., d−1 > 0 are solutions of the equation system (9c) and b and g are defined by (9a) and (9b), respectively, then g b is a density of a probability measure Q with respect to P on H, such that h is the marginal density of Q with respect to P on H and Q[A i ] = q i holds for i = 1, . . . , d.
Proof. First, we show that (9a)-(9c) are necessary if Q and P are related by factorizable joint shift as in (8a).
The converse statement follows from the following observations: • With b and g as in (9a) and (9b), E P [g b] = 1 holds such that g b is an H-measurable density with respect to P.
Thanks to Theorem 2, the following version of the posterior correction formula (6) can be given for factorizable joint shift.
Proof. Apply the generalised Bayes formula (Lemma A1 in Appendix A) for G = H, X = 1 A j and f = g b, with g and b specified by (9b) and (9a), respectively.
Recall that P[A k | H]/P[A k ] is the density with respect to P of the class-conditional feature distribution P k , as defined by (3), on the feature information set H. Similarly, Q[A k | H]/Q[A k ] is the density with respect to Q of the class-conditional feature distribution Q k on H. Therefore, (13) states that under factorizable joint shift, the ratios of the class-conditional feature densities are invariant up to a constant factor.
Remark 1 suggests joint factorizable shift could also be called scaled density ratios shift. This term would emphasise a probabilistic interpretation of this kind of dataset shift, in contrast to "factorizable joint shift" with its focus on the technical aspect of separation of input and output variables.

Alternatives to Joint Importance Aligning
He et al. [3] proposed in Section 3 the "joint importance aligning" method for estimating a factorized version of the ratio of source and target domain densities which they called "joint importance weight". He et al. presented a "supervised" and an "unsupervised" version of their method. The "unsupervised" version was intended for the case where no class labels were observed in the target domain, i.e., the case considered primarily in this paper.
Regarding the performance of the "unsupervised" version of their proposal, He et al. indicated that the proposed method tended to present simple covariate shift (see Section 5.2 below) as a solution. This does not come as a surprise because He et al. [3] stated ". . . in unsupervised objective, we defineṼ(x) E y∼D S (y|x) V(y) . . . ", which suggests that the authors implicitly assumed D S (y|x) = D T (y|x), i.e., covariate shift. Without providing an explanation, He et al. proposed a discretisation of the data (covariate) space in order to prevent the algorithm from converging to covariate shift as solution.
Given these qualms about "joint importance aligning', it might be useful to point out alternative approaches to finding the factorization (8a), based on Theorem 2. The theorem suggests two obvious ways to learn the characteristics of factorizable joint shift: See Section 4.2.4 of Tasche [13] for an example of approach (a) from the area of credit risk. Regarding the interpretation of (9c) in approach (b) as maximum likelihood equations, see Du Plessis and Sugiyama [14] or Tasche [15]. This interpretation, in particular, implies that an EM (expectation maximisation) algorithm can be deployed for solving the equation system (Saerens et al. [4]).

The Binary Case
Theorem 2 does not provide sufficient or necessary conditions for the existence or uniqueness of solutions to equation system (9c) if a density h and a candidate class distribution (q i ) i=1,...,d are given. In the special case d = 2, such an existence and uniqueness statement can be made as the following proposition shows. The following proposition is a generalisation of Section 4.2.4 of Tasche [13]. Proposition 2. Let (Ω, F , P) be a probability space, H ⊂ F a sub-σ-algebra of F and A ∈ F \ H with 0 < p = P[A] < 1. Assume that P P[A | H] ∈ {0, 1} = 0.
Assume additionally that H and A are not independent under P. Then, the solution to (9c) is unique. Denote by φ : (0, 1) → (0, ∞) the function that maps, for a fixed density h, the number 0 < q < 1 to , i.e., φ(q) = . Then, φ has the following properties: (i) φ is strictly increasing and continuous on (0, 1).
See Appendix B for a proof of Proposition 2. The uniqueness statement of Proposition 2 is interesting because it implies an answer to the question of whether proper concept shift (dataset shift where the marginal distributions of the features and labels, respectively, remain unchanged) can be modelled as factorizable joint shift. The answer-at least for the binary case-is no, because "no shift" then provides the only solution to Equation (9c).

Common Types of Dataset Shift
In this section, we revisit some popular special cases of dataset shift. In each case, we discuss the question if factorizable joint shift is implied or if the special type of shift is implied by factorizable joint shift. In addition, we provide in each case an adapted version of the posterior correction formula (6).

Prior Probability Shift
Moreno-Torres et al. [2] defined prior probability shift as invariance of the class-conditional feature distributions between source and target, i.e., with Q i and P i defined as in (3) above, and Q[A i ] = P[A i ] for at least one i. This type of dataset shift is also known as "target shift" [16], "global drift" [17], "label shift" [18] and under other names. In terms of the notation used in Theorem 1, (14a) is equivalent to having the densities of the Q i with respect to the P i on the feature information set H equal to 1, i.e., By Theorem 1, (14b) implies for the density h of Q with respect to P on H that which obviously is an A-measurable function. Definition 3 of factorizable joint shift, therefore, is satisfied-as stated by He et al. [3] in Table 1.
The posterior correction formula (6) in this case takes the well-known form as noted before, e.g., by Saerens et al. [4] and Elkan [5].

Covariate Shift
Moreno-Torres et al. [2] defined covariate shift as invariance of the posterior class probabilities between source and target, i.e., Proof. The "if" part of the assertion is Lemma 1 of Tasche [8]. Taking into account Notation 1, the "only if" is implied by Corollary 3.
Proposition 3 implies that covariate shift is a special case of factorizable joint shift in the sense of Definition 3, with b = 1 and g = h, as noted in Table 1 of He et al. [3].
Then, observe that the fact that b is constant implies by (9a) that It can readily be checked that under the assumption of covariate shift the i defined by (18) indeed solve equation system (9c).

Covariate Shift with Posterior Drift
Scott [11] defined covariate shift with posterior drift (CSPD) for the binary special case (d = 2) of Assumption 1 as the following variant of (17): there exists a strictly increasing function ϕ such that Equation (19) implies that Q[A 1 | H] and P[A 1 | H] are strongly comonotonic. As shown in Tasche [19], the converse implication also holds true.
Note that from (19), it also follows that Hence, the increasing link between the posterior positive class probabilities defining CSPD does not only apply to class A 1 but automatically also to the negative class A 2 .
CSPD is implied by factorizable joint shift. This follows from (12) because of which is strictly increasing in x.
Under CSPD, the class-conditional densities h i = dQ i dP i H, i = 1, 2, introduced in Lemma 1 can be shown to be where h is the density of Q with respect to P on the feature information set H. Alas, when used in connection with Theorem 1, (20) does not provide a very useful representation of h.

Domain Invariance
Translated into the concepts and notation of this paper, domain invariance (see Table 1 of He et al. [3]) is defined as follows:

•
There is an H-measurable mapping (transformation) T into some measurable space with the property that where G = σ(T) denotes the smallest sub-σ-algebra of H such that T is still G-measurable. • For all i = 1, . . . d it holds that: Property (21b) means that T is sufficient for H under both P and Q in the sense of Section 32.3 of Devroye et al. [20].
As mentioned in He et al. [3], (21a) implies covariate shift with respect to G, i.e., From (21b) then follows covariate shift with respect to H. Actually, this reasoning shows that in the definition of domain invariance according to He et al. [3], (21a) could be replaced by the weaker assumption (21c), without losing the consequence that covariate shift holds on the whole information set H.

Generalised Label Shift
Tachet des Combes et al. [21] defined generalised label shift (GLS) as follows: there is an H-measurable mapping (transformation) T into some measurable space with the property that Since σ(T) ⊂ H holds, this is weaker than requiring (14a) as for prior probability shift. In this sense, GLS generalises prior probability shift.
He et al. [3] gave in Table 1 a narrower definition of GLS, by requiring in addition to (22) also (21b), and went on to prove that GLS implied factorizable joint shift. We provide an alternative proof of this result, providing mathematically rigorous meaning for the factorisation proposed by He et al.

Proposition 4.
Under Assumption 2, let there be an H-measurable mapping T into some measurable space such that (22) and (21b) hold. Denote by h the density of the target distribution Q with respect to the source distribution P on H. Then, Q and P are related by factorizable joint shift in the sense of See Appendix B for a proof of Proposition 4. Observe that Proposition 4 and Corollary 4 together imply that the same class posterior correction formula (16) applies for generalised label shift and prior probability shift.
The factorisation presented in (23) of Proposition 4 corresponds to the factorisation of generalised label shift proposed by He et al. [3] in Table 1 in the following way:  (23) as the density ratio g = h/γ, hence with a well-defined mathematical meaning.
Remark 2. Proposition 4 combined with Remark 1 shows that "generalized label shift" in the sense of He et al. [3] is the same type of dataset shift that was discussed as "invariant density ratio"-type dataset shift in Tasche [22].

Sample Selection Bias
Sample selection bias is an important cause of dataset shift. In this subsection, we revisit parts of Hein [23] in order to illustrate some of the concepts and results presented before. We basically work under Assumption 1 but without the interpretation of P as source and Q as target distribution. Instead, P is interpreted as the distribution of a population from which a potentially biased random sample is taken, resulting in the distribution Q. When studying sample selection bias in this setting, the goal is to infer properties of P from properties of the sample distribution Q.
The following assumption describes the setting of this section. The idea is that under the population distribution, each object has a positive chance to be selected. This chance may depend upon the features (covariates) and the class of the object.

Assumption 3 (Sample selection).
(Ω, F ) is a measurable space. The population distribution P is a probability measure on (Ω, F ). For some positive integer d ≥ 2, events A 1 , . . . , A d ∈ F and a sub-σ-algebra H ⊂ F are given. The events A i , i = 1, . . . , d and H have the following properties: The selection probability is an H-measurable random variable 0 < ϕ ≤ 1 where the sub-σalgebra H is defined as in (1b).
The probability space (Ω, F , P) also supports a random variable U which is uniformly distributed on [0, 1] such that U and H are independent. Definition 4 (Sample distribution). Under Assumption 3, define the event of being selected by S = {U ≤ ϕ}. The probability measure Q on (Ω, F ), defined by Note that the measure Q is well-defined because from the independence of U and H, it follows that Another consequence of the independence of U and H is Proposition 5. P and Q as described in Assumption 3 and Definition 4 satisfy Assumptions 1 and 2 with P as source distribution and Q as target distribution. Moreover, P is absolutely continuous with respect to Q on H.
By definition of Q as P conditional on S, the sample distribution Q is absolutely continuous with respect to P on F and hence also on H ⊂ F . For the density h, we obtain The fact that h is positive implies that P is absolutely continuous with respect to Q on H. Since A i ∈ H for i = 1, . . . , d, the absolute continuity of P with respect to Q implies Q[A i ] > 0, i = 1, . . . , d.

Properties of the Sample Selection Model
Equation (25) implies for the density h of Q with respect to P on H From representation (1b) of H, the following alternative description for P[S | H] follows: where the P i denote the class-conditional feature distributions under P, see Definition 2. P i [S | H] is accordingly the feature-conditional probability of being selected on the subpopulation of objects with class A i . For i = 1, . . . , d and H ∈ H, a short calculation shows: This implies Equation (28) and Theorem 1 together imply the following alternative representation of h: By the generalised Bayes formula (Lemma A1 in Appendix A), (25) implies the following representation of the posterior class probabilities Q[A i | H], i = 1, . . . , d, under Q: Zadrozny [24] and Hein [23] observed that if the event S of being selected and the class labels as expressed by the σ-algebra A were independent conditional on H, the information set reflecting the features, then the population distribution P and the sample distribution Q were related by covariate shift. A consequence of (30) is that the converse of this observation actually also holds true, as stated in the following proposition. Proof. Proposition 6 is obvious from (30) and the definition of covariate shift (17).
In the case of general dataset shift caused by sample selection, Equation (30) does not provide information about how to compute the population posterior class probabilities P[A i | H] from the sample posterior class probabilities Q[A i | H]. Translated into the setting of this paper, Hein [23] presented in Equation (3.2) the following two ways to do so: • Define Q * as the distribution of the not-selected sample, i.e., Then, it holds that Both (31a) and (31b) are of limited practical usefulness, however, as on the one hand, (31a) requires knowledge of the class labels in the not-selected sample, which usually are not available. On the other hand, for (31b) to be applicable, class-wise probabilities of selection P i [S | H] must be estimated, which again requires knowledge of the class labels in the notselected sample.

Sample Selection Bias and Factorizable Joint Shift
Proposition 6 provides an example of a condition for the sample selection process that makes the resulting bias between population and sample representable as covariate shift and, consequently, according to Section 5.2, as a special case of factorizable joint shift. Are there other selection procedures that entail factorizable joint shift?
We investigate this question by assuming that the population distribution P and the sample distribution Q are related by factorizable joint shift and then identifying the consequences this assumption implies for the class-wise feature-conditional selection probabilities P i [S | H], i = 1, . . . , d.

Theorem 3.
Under Assumption 3 and Definition 4, let P and Q be related by factorizable joint shift in the sense of Definition 3, i.e., there are an H-measurable function g ≥ 0 and an A-measurable function b ≥ 0 such that the density h of Q with respect to P on H can be represented as h = g b. Then, the following statements hold true: (i) Q and P are related by factorizable joint shift with an H-measurable function g * > 0 and an A-measurable function b * > 0 that can be represented up to a constant factor in the sense of (8b) as where the constants 0 < α 1 , . . . , α d−1 < ∞ satisfy the following equation system, with i = 1, . . . , d − 1: where the constants 0 < α 1 , . . . , α d−1 < ∞ satisfy equation system (32b). (iii) The class-wise feature-conditional selection probabilities P i [S | H], i = 1, . . . , d, can be represented as where the constants 0 < α 1 , . . . , α d−1 < ∞ satisfy equation system (32b) and α d = 1.
Proof. Functions g and b must be positive since h is positive according to Proposition 5. Hence, Q and P are related by factorizable joint shift with decomposition b * = 1/b and g * = 1/g. Apply Theorem 2 with swapped roles of P and Q to obtain representation (32a) and equation system (32b). Statement (ii) follows immediately from Corollary 4. Regarding (iii), use (28) and Proposition 1 (iv) together with (32a) to obtain This is equivalent to (34).
As mentioned in Section 4.1 as a potential application of Theorem  In case (a), (34) may serve as an admissibility check for the solutions found. If the classwise selection probabilities P i [S | H] obtained from (34) can take values greater than 100%, the corresponding set of values (α 1 , . . . , α d−1 ) is not an admissible solution of (32b). If all solutions (α 1 , . . . , α d−1 ) of (32b) turn out to be inadmissible, it must be concluded that the assumption of factorizable joint shift for the sample selection process is wrong.
In case (b), from (34) follows for all i, j = 1, . . . , d Inequality (35) provides a simple necessary criterion for the presence of factorizable joint shift with constants α i all equal to 1.
A further, less obvious special case of Theorem 3 is encountered if is assumed that

Conclusions
We revisited the notion of "factorizable joint shift" recently introduced by He et al. [3]. A main finding is that factorizable joint shift is actually not much more general than prior probability shift or covariate shift. However, in contrast to these two types of shifts, factorizable joint shift is not fully identifiable if no class label information on the test (target) dataset is available and no additional assumptions are made. These findings are based on a representation result (Theorem 2) and a comparison of the class posterior correction formula (12) for factorizable joint shift to the related correction formulae (16) and (17) for prior probability and covariate shifts, respectively. Formula (12) is structurally identical with formula (16) but includes additional constants which can be found by solving the nonlinear equation system (9c).
He et al. [3] did not present the full rationale for their joint importance aligning approach to estimating the characteristics of factorizable joint shift. Hence, solving equation system (9c) for the additional constants in the posterior correction formula or for the prior class probabilities under the target distribution can be considered attractive alternative approaches.
Some open research questions remain: • Under what conditions can the existence and the uniqueness of solutions ( 1 , . . . , d−1 ) to equation system (9c) be guaranteed in the case of more than two classes?
• Is there any manageable-in the sense of having observable characteristics-type of dataset shift which is both more complex than factorizable joint shift and less complex than covariate shift with posterior drift? • To which extent can Theorem 2 be adapted for a more general regression setting?
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.

Acknowledgments:
The author thanks three anonymous reviewers for suggestions that helped to improve an earlier version of this paper.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. The Generalized Bayes Formula
Lemma A1 is Theorem 10.8 of Klebaner [25], slightly extended to explicitly cover the case when the denominator in the formula for the density can be 0.
Lemma A1. Let (Ω, F ) be a measurable space and P and Q probability measures on (Ω, F ). Assume that f = dQ dP is a density of Q with respect to P on F . Let G be a sub-σ-algebra of F and X be a non-negative random variable on (Ω, F ) or a random variable on (Ω, F ) such that f X is P-integrable. Then, the following two statements hold: Proof. For (i): Observe that This implies For (ii): see Klebaner [25], proof of Theorem 10.8.

Appendix B. Proofs
Appendix B.1. Proof of Proposition 2 For a more concise notation, define the non-negative, H-measurable random variables R 1 and R 2 by Then, (9c) can be written as Some algebra shows that (A1a) is equivalent to and that it is also equivalent to Define the function g( ) = E P h R 2 q R 1 +(1−q) R 2 for ≥ 0. Then, it holds that • g( ) ≤ 1 1−q < ∞ for all ≥ 0; • g(0) = 1 1−q > 1; • By the dominated convergence theorem, g is continuous for 0 ≤ < ∞ with lim →∞ g( ) = 0.
By the mean value theorem, these properties of g imply the existence of some > 0 with g( ) = 1. By the equivalence of (A1a) and (A1b), the existence of a positive solution to (9c) follows.

Appendix B.2. Proof of Proposition 4
Observe that function g in (23) is well-defined because the denominator on the right-hand side of the equation is always positive. On the one hand, by Corollary 2, we obtain for i = 1, . . . , d on the set {h > 0} where h i denotes the density of the target class-conditional feature distribution Q i with respect to the source class-conditional feature distribution P i on H.