Counterfactual Supervision-Based Information Bottleneck for Out-of-Distribution Generalization

Learning invariant (causal) features for out-of-distribution (OOD) generalization have attracted extensive attention recently, and among the proposals, invariant risk minimization (IRM) is a notable solution. In spite of its theoretical promise for linear regression, the challenges of using IRM in linear classification problems remain. By introducing the information bottleneck (IB) principle into the learning of IRM, the IB-IRM approach has demonstrated its power to solve these challenges. In this paper, we further improve IB-IRM from two aspects. First, we show that the key assumption of support overlap of invariant features used in IB-IRM guarantees OOD generalization, and it is still possible to achieve the optimal solution without this assumption. Second, we illustrate two failure modes where IB-IRM (and IRM) could fail in learning the invariant features, and to address such failures, we propose a Counterfactual Supervision-based Information Bottleneck (CSIB) learning algorithm that recovers the invariant features. By requiring counterfactual inference, CSIB works even when accessing data from a single environment. Empirical experiments on several datasets verify our theoretical results.


Introduction
Modern machine learning models are prone to catastrophic performance loss during deployment when the test distribution is different from the training distribution. This phenomenon has been repeatedly witnessed and intentionally exposed in many examples [1][2][3][4][5]. Among the explanations, shortcut learning [6] is considered as a main factor causing this phenomenon. A good example is the classification of images of cows and camels-a trained convolutional network tends to recognize cows or camels by learning spurious features from image backgrounds (e.g., green pastures for cows and deserts for camels), rather than learning the causal shape features of the animals [7]; decisions based on the spurious features would make the learned models fail when cows or camels appear in unusual or different environments. Machine learning models are expected to have the capability of out-of-distribution (OOD) generalization and avoid shortcut learning.
To achieve OOD generalization, recent theories [8][9][10][11][12] are motivated by causality literature [13,14] and resort to extraction of the invariant, causal features and establishing the relevant conditions under which machine learning models have guaranteed generalization. Among these works, invariant risk minimization (IRM) [8] is a notable learning paradigm that incorporates the invariance principle [15] into practice. In spite of the theoretical promise of IRM, it is only applicable to problems of linear regression. For other problems, such as linear classification, Ahuja et al. [12] first show that for OOD generalization, linear classification is more difficult (see Theorem 1) and propose a new learning method of information bottleneck-based invariant risk minimization (IB-IRM) based on the support overlap assumption (Assumption 7). In this work, we closely investigate the conditions identified in [12] and propose improved results for OOD generalization of linear classification.
Our technical contributions are as follows. In [12], a notion of support overlap of invariant features is assumed in order to make the OOD generalization of linear classification successful. In this work, we first show that this assumption is strong, but it is still possible to achieve such goal without this assumption. Then, we examine whether the IB-IRM proposed in [12] is sufficient to learn invariant features for linear classification and find that IB-IRM (and IRM) could fail in two modes. We then analyze two failure modes of IB-IRM and IRM, in particular when the spurious features in training environments capture sufficient information for the task of interest but have less information than the invariant features. Based on the above analyses, we propose a new method, termed counterfactual supervision-based information bottleneck (CSIB), to address such failures. We prove that without the need of the support overlap assumption, CSIB is theoretically guaranteed for the success of OOD generalization in linear classification. Notably, CSIB works even when accessing data from a single environment. Finally, we design three synthetic datasets and a colored MINST dataset based on our examples; experiments demonstrate the effectiveness of CSIB empirically.
The rest of this article is organized as follows. The learning problem of out-ofdistribution (OOD) generalization is formulated in Section 2. In Section 3, we study the learnability of the OOD generalization with different assumptions to the training and test environments. Using these assumptions, two failure modes of previous methods (IRM and IB-IRM) are analysed in Section 4. Based on the above analysis, our method is then proposed in Section 5. The experiments are reported in Section 6. Finally, we discuss the related works in Section 7 and provide some conclusions and limitations of our work in Section 8. All the proofs and details of experiments are given in the Appendices A and B.

Background on Structural Equation Models
Before introducing our formulations of OOD generalization, we provide a detailed background on structural equation models (SEMs) [8,13].
where Pa(X i ) ⊆ {X 1 , . . . , X d } \ {X i } are called the parents of X i , and N i are independent noise random variables. For every SEM, we yield a directed acyclic graph (DAG) G by adding one vertex for each X i and directed edges from each parent in Pa(X i ) (the causes) to child X i (the effect).

Definition 2 (Intervention).
Consider an SEM C = (S, N). An intervention e on C consists of replacing one or several of its structural equations to obtain an intervened SEM C e = (S e , N e ), with structural equations: S e i : X e i ← f e i (Pa e (X e i ), N e i ), The variable X e is intervened if S i = S e i or N i = N e i .
In an SEM C, we can draw samples from the observational distribution P(X) according to the topological ordering of its DAG G. We can also manipulate (intervene) a unique SEM C in different ways, indexed by e, to different but related SEMs C e , which results in different interventional distributions P(X e ). Such family of interventions are used to model the environments.

Formulations of OOD Generalization
In this paper, we study the OOD generalization problem by following the linear classification structural equation model below [12]. Assumption 1 (Linear classification SEM C ood ).
where w * inv ∈ R m is the labeling hyperplane, Z inv ∈ R m , Z spu ∈ R o , X ∈ R d , ⊕ is the XOR operator, S ∈ R d×(m+o) is invertible (d = m + o), · is the dot product function, and 1(a) = 1 if a ≥ 0 otherwise 0.
The SEM C ood governs four random variables {X, Y, Z inv , Z spu }, and its directed acyclic graph (DAG) is illustrated in Figure 1a, where the exogenous noise variable N is omitted. Following Definition 2, each intervention e generates a new environment e with interventional distribution P(X e , Y e , Z e inv , Z e spu ). We assume only the variables of X e and Y e are observable. In OOD generalization, we are interested in a set of environments E all defined as below. Figure 1. (a) DAG of the SEM C ood (Assumption 1); (b-d) DAGs of the interventional SEM C e ood in the training environments E tr with respect to different correlations between Z inv and Z spu . Grey nodes denote observed variables, and white nodes represent unobserved variables. Dashed lines denote the edges which might vary across the interventional environments and even be absent in some scenarios, whilst solid lines indicate that they are invariant across all the environments. All exogenous noise variables are omitted in the DAGs.

Definition 3 (E all ).
Consider the SEM C ood (Assumption 1) and the learning goal of predicting Y from X. Then, the set of all environments E all (C ood ) indexes all the interventional distributions P(X e , Y e ) obtainable by valid interventions e. An intervention e ∈ E all (C ood ) is valid as long as (i) the DAG remains acyclic, (ii) P(Y e |Z e inv ) = P(Y|Z inv ), and (iii) P(X e |Z e inv , Z e spu ) = P(X|Z inv , Z spu ).
where R e ( f ) := E X e ,Y e [l( f (X e ), Y e )] is the risk under the environment e with l(·, ·) the 0-1 loss function. Since E ood may be different from E tr , this learning problem is called OOD generalization. We assume the predictor f = w • Φ includes a feature extractor Φ : X → H and a classifier w : H → Y. With a slight abuse of notation, we also let the classifier w and feature extractor Φ be parameteried by themselves, respectively, as w ∈ R c+1 and Φ ∈ R c×d with c the number of feature dimension.

Background on IRM and IB-IRM
To minimize Equation (2), two notable solutions of IRM [8] and IB-IRM [12] are listed as follows: IB-IRM: min where , and h e (Φ) = H(Φ(X e )) with H the Shannon entropy (or a lower bounded differential entropy), and rth is the threshold on the average risk. If we drop the invariance constraint from IRM and IB-IRM, we obtain standard empirical risk minimization (ERM) and information bottleneck-based empirical risk minimization (IB-ERM), respectively. The use of an entropy constraint in IB-IRM is inspired from the information bottleneck principle [16] where mutual information I(X; Φ(X)) is used for information compression. Since the representation Φ(X) is a deterministic mapping of X, we have thus minimizing the entropy of Φ(X) is equivalent to minimizing the mutual information I(X; Φ(X)). In brief, the optimization goal of IB-IRM is to select the one that has the least entropy among all highly predictive invariant predictors.

OOD Generalization: Assumptions and Learnability
To study the learnability of OOD generalization, we make following definition.

Definition 4.
Given E tr ⊂ E all and E ood ⊆ E all . We say an algorithm succeeds to solve OOD generalization with respect to (E tr , E ood ) if the predictor f * ∈ F returned by this algorithm satisfies the following equation: max where F is the learning hypothesis (a function set including all possible linear classifier). Otherwise we say it fails to solve OOD generalization.
So far, we have omitted how different environments of E tr and E ood exactly are to enable OOD generalization. Different assumptions about E tr and E ood make the OOD generalization problem different.

Assumptions about the Training Environments E tr
Define the support set of the invariant (resp., spurious) features Z e inv (resp., Z e spu ) in environment e as Z e inv (resp., Z e spu ). In general, we make following assumptions to the invariant features Z e inv in the training environments E tr .
The difficulties of OOD generalization are due to the spurious correlations between Z inv and Z spu in the training environments E tr . In this paper, we consider three modes induced by different correlations between Z inv and Z spu as shown below. Assumption 4 (Spurious correlation 1). Assume each e ∈ E tr , Z e spu ← AZ e inv + W e ; where A ∈ R o×m , and W e ∈ R o is a continuous (or discrete with each component supported on at least two distinct values), bounded, and zero mean noise variable.

Assumption 5 (Spurious correlation 2).
Assume each e ∈ E tr , where A ∈ R m×o , and W e ∈ R m is a continuous (or discrete with each component supported on at least two distinct values), bounded, and zero mean noise variable.

Assumption 6 (Spurious correlation 3).
Assume each e ∈ E tr , where W e 0 ∈ R o and W e 1 ∈ R o are independent noise variables.
For each e ∈ E tr , the DAGs of its corresponding interventional SEMs C e ood with respect to Assumptions 4-6 are illustrated in Figure 1b-d, respectively. It is worth noting that although the DAGs are identical across all training environments in each mode of Assumptions 4-6, the interventional SEMs C e ood among different training environments are different due to the interventions on the exogenous noise variables.

Assumptions about the OOD Environments E ood
Theorem 1 (Impossibility of guaranteed OOD generalization for linear classification [12]). Suppose E ood = E all . If for all the training environments E tr , the latent invariant features are bounded and strictly separable, i.e., Assumptions 2 and 3 hold, then every deterministic algorithm fails to solve the OOD generalization.
The above theorem shows that it is impossible to solve OOD generalization if E ood = E all . To make it learnable, Ahuja et al. [12] propose the support overlap assumption (Assumption 7) to the invariant features.
However, Assumption 7 is strong, and we would show that it is still possible to solve OOD generalization without this assumption. For better illustration, consider an OOD generalization task from P(X e 1 , Y e 2 ) to P(X e 2 , Y e 2 ) with E tr = {e 1 } and E ood = {e 2 }, and the support sets of the corresponding invariant features Z e 1 inv and Z e 2 inv are intuitively illustrated in Figure 2c (assume dim(Z inv ) = 2 in this example). From Figure 2c, it is clear that although the support sets of invariant features between the two environments are different, it is still possible to solve OOD generalization if the learned feature extractor Φ only captures the invariant features, e.g., Φ(X) = Z inv . Here, dim(Z inv ) = 2 and Z inv = (Z 1 , Z 2 ). The blue and black regions represent the support sets of Z e 1 inv and Z e 2 inv , corresponding to the environments e 1 and e 2 , respectively. E tr = {e 1 } is the training environment and E ood = {e 2 } is the OOD environment. Although Assumption 7 does not hold in this example, any zero-error classifier with Φ(X) = Z inv on the e 1 environment data would clearly make the classification error zero in e 2 , thus succeeding to solve OOD generalization.
To make Assumption 7 weaker, we propose the following assumption.
, Y e ) be the mixture distribution of invariant features in the training environments. Denote A be a hypothesis set including all linear classifiers mapping from R m to Y. ∀e ∈ E ood , assume F l (P(Z tr inv , Y tr )) ⊆ F l (P(Z e inv , Y e )), where l is the 0-1 loss function and F l ( Clearly, under the assumption of separable invariant features (Assumption 3), for any e ∈ E ood , Assumption 7 holds ⇒ Z e inv ⊆ Z tr inv ⇒ F l (P(Z tr inv , Y tr )) ⊆ F l (P(Z e inv , Y e )) ⇒. Assumption 8 holds, but not vice versa. Therefore, Assumption 8 is weaker than Assumption 7. We show that Assumption 8 could be substituted for Assumption 7 for the success of OOD generalization in our proposed method in Section 5.

Failures of IRM and IB-IRM
Under Spurious Correlation 1 (Assumption 4), the IB-IRM algorithm has been shown to enable OOD generalization, while IRM fails [12]. In this section, we would show that both IRM and IB-IRM could fail under Spurious Correlations 2 and 3 (Assumptions 5 and 6).

Failure under Spurious Correlation 2
Example 1 (Counter-Example 1). Under Assumption 5, let Z e inv ← Z e spu + W e with dim(Z e inv ) = dim(Z e spu ) = dim(W e ) = 1 and w * inv = 1 be the generated classifier in Assumption 1. We assume two training environments and a OOD environment as:  Figure 2a shows the support points of these features in the training environments. Then, by applying any algorithm to solve the above example with rth = q, we would obtain a predictor of f * = w * • Φ * . Consider the prediction made by this model as (we ignore the classifier bias for convenience) It is trivial to show that the f * of Φ * inv = 0 and Φ * spu = 1 is an invariant predictor across training environments with classification error R e 1 = R e 2 = q, and it achieves the least entropy of h e (Φ * ) = 0 for each training environment e. Therefore, it is a solution of IB-IRM and IRM. However, the predictor of f * relies on spurious features and has the test error R e 3 = 0.5; thus, it fails to solve the OOD generalization.

Failure under Spurious Correlation 3
inv be a discrete variable supported uniformly on six points {−4, −3, −2, 2, 3, 4} among all environments, and w * inv = 1 be the generated classifier in Assumption 1. We assume two training environments and a OOD environment as: Figure 2b shows the support points of these features in the training environments. Then, by applying any algorithm to solve the above example with rth = q, we would obtain a predictor of f * = w * • Φ * . Consider the prediction made by this model as (we ignore the classifier bias for convenience): It is trivial to show that the f * of Φ * inv = 0 and Φ * spu = 1 is an invariant predictor across training environments with classification error R e 1 = R e 2 = 0, and it achieves the least entropy of h e (Φ * ) = 1 among all highly predictive predictors for each training environment e. and Therefore, it is a solution of IB-IRM and IRM. However, the predictor of f * relies on spurious features and has the test error R e 3 = 1; thus, it fails to solve the OOD generalization.

Understanding the Failures
From the illustrations of the above simple examples, we can conclude that the failure of the invariance constraint for removing the spurious features is because the spurious features among all training environments are strictly linearly separable by their corresponding labels. This would make the predictor rely only on spurious features to achieve minimum training error and also be the invariant predictor across training environments. Since the label set is finite (with only two values in binary classification) in classification problems, such a phenomenon may exist. We state such failure mode formally as below.

Theorem 2.
Given any E tr ⊂ E all and E ood ⊆ E all satisfying Assumptions 2, 3, and 7, if two sets ∪ e∈E tr Z e spu (Y e = 1) and ∪ e∈E tr Z e spu (Y e = 0) are linearly separable and H(Z e inv ) > H(Z e spu ) on each training environment e, then IB-IRM (and IRM, ERM, or IB-ERM) with any rth ∈ R fails to solve the OOD generalization.
The understanding of Theorem 2 is intuitive since when the spurious features in the training environments with respect to different labels are linearly separable, there is no algorithm that can distinguish spurious features from invariant features. Although the assumption of linear separation of the spurious features seems strong for this failure, it is easy to hold in high-dimensional space when dim(Z spu ) is large (common cases in practice such as image data). We show one case in Appendix A.3 that if the number of environments is |E tr | < dim(Z spu )/2 under Assumption 6, the spurious features in the training environments are probably separable by their labels. This is because in odimensional space there is a high probability that o randomly drawn distinct points are linearly separable for any two subsets.

Counterfactual Supervision-Based Information Bottleneck
In the above analyses, we have shown two failure modes of IB-IRM and IRM for OOD generalization in the linear classification problem. The key reason for the failure is due to the learned features Φ(X) that rely on spurious features. To prevent such failure, we present the counterfactual supervision-based information bottleneck (CSIB) learning algorithm for removing the spurious features progressively.
In general, the IB-ERM method is applied to extract features from the beginning of each iteration: min Due to the information bottleneck, only a part of the information of the input X are exploited in Φ(X). If the information of spurious features Z spu exists in the learned features Φ(X), the idea of CSIB is to drop such information and meanwhile maintain the causal information (represented by invariant features Z inv ) as well. However, achieving such a goal faces two challenges: (1) how to determine whether Φ(X) contains spurious information of Z spu ? and (2) how to remove the information of Z spu ? Fortunately, due to the orthogonality in the linear space, it is possible to disentangle the features that are exploited by Φ(X) (denoted as X 1 ) and the features that are not exploited by Φ(X) (denoted as X 2 ) via Singular Value Decomposition (SVD). Based on that, we could construct an SEM C new governing three variables of X 1 , X 2 , and X. Therefore, by conducting counterfactual interventions on X 1 and X 2 in C new , we could solve the first challenge by requiring a single supervision on the counterfactual examples X . For example, if we intervene on X 1 and find that the causal information remains in the resulting X , then the extracted features Φ(X) are definitely the spurious features. To address the second challenge, we replace the input by X 2 by filtering out the information of X 1 and conduct the same learning procedure from the beginning.
The learning algorithm of CSIB is illustrated in Algorithm 1, and Figure 3 shows the framework of CSIB. We show in Theorem 3 that CSIB is theoretically guaranteed to succeed to solve OOD generalization. Theorem 3 (Guarantee of CSIB). Given any E tr ⊂ E all and E ood ⊆ E all satisfying Assumptions 2, 3, and 8, then for every spurious correlation of Assumptions 4, 5, and 6 (in this correlation mode, assume the spurious features are linearly separable in the training environments), the CSIB algorithm with rth = q succeeds in solving the OOD generalization.

15:
i ← i + 1 16: end while 17: i ← 0 18: end while 25: end if 26: if label(x 1 ) = label(x 2 ) then 27: Lr.append(r); Lv.append(V T ) 28: Step 2 30: end if 31: w ← w * ; Φ ← Φ * End Remark 1. CSIB succeeds to solve OOD generalization without assuming the support overlap to invariant features and could apply to multiple spurious modes where IB-IRM (as well as ERM, IRM, and IB-ERM) may fail. By introducing counterfactual inference and further supervision (usually conducted by a human) with several steps, CSIB works even when accessing data from a single environment, which is significant especially in the cases where multiple environments' data are not available.

Toy Experiments on Synthetic Datasets
We perform experiments on three synthetic datasets from different spurious correlations modes to verify our method-counterfactual, supervision-based, and information bottleneck (CSIB)-and compare them to ERM, IB-ERM, IRM, and IB-IRM. We follow the same protocol for tuning hyperparameters from [8,12,17] and report the classification error for all experiments. In the following, we first briefly describe the designed datasets and then report the main results. More experimental details can be found in the Appendix.

Datasets
Example 1/1S. The example is a modified one from the linear unit tests introduced in [17], which generalizes the cow/camel classification task with relevant backgrounds.
The dataset D e of each environment e ∈ E tr is sampled from the following distribution We set s e 0 = 0.5, s e 1 = 0.7, and s e 2 = 0.3 for the first three environments, and s e j ∼ Uniform (0.3, 0.7) for j > 3. The scrambling matrix S is an identical matrix in Example 1 and a random unitary matrix in Example 1S. Here, we set p e = 1 and q = 0 for all environments to make the spurious features and the invariant features both linearly separable to confuse each other. The experiments on different values of q and p e are presented in the Appendix, where we have found very interesting observations related to the inductive bias of neural networks. Example 2/2S. This example is extended from Example 1 to show one of the failure modes of IB-IRM (as well as ERM, IRM, and IB-ERM) and how our method can be improved by intervention (counterfactual supervision). Given w e ∈ R, each instance in the environment data D e is sampled by where we set m = o = 5, and A ∈ R m×o is the identical matrix in our experiments. We set w e 0 = 3, w e 1 = 2, w e 2 = 1, and w e j = Uniform(0, 3) if j > 3 for different training environments. This example shows clearer smaller entropy of spurious features than that of invariant features, which is opposite Example 1/1S. Example 3/3S. This example extends from Example 2 and is similar to the construction of Example 2/2S. Let w e ∼ Uniform(0, 1) for different training environments. Each instance in the environments e is sampled by where we set m = o = 5 in our experiments. The spurious features have smaller entropy than the invariant features in this example, which is similar to Example 2/2S, but the invariant features significantly enjoy much larger margin than the spurious features, which is very different from the above two examples. We show a summary of the properties of these three datasets in Table 1 for a general view. Table 1. Summary of three synthetic datasets. Note that for linearly separable features, their margin levels significantly influence the final learning classifier due to the implicit bias of the gradient descent [18]. Such bias would push the standard learning (such as cross-entropy loss) to focus more on the large-margin features. The margin with respect to a dataset (or features) Z (each instance has a label 0 or 1) is the minimum distance between a point in Z and the max-margin hyperplane, which separates Z by its labels.  can see that although ERM shows not-bad results due to the significantly larger margin of invariant features, our CSIB method still shows improvements by removing more spurious features. Notably, comparing the IB-ERM and IB-IRM when only spurious features are extracted (Example 2/2S, Example 3/3S), our CSIB method could effectively remove them by counterfactual supervision and then refocus on the invariant features. Note that the reason of non-zero average error and the fluctuant results of CSIB in some experiments is that the entropy minimization in the training process is less accurate, where entropy is substituted by variance for the ease of the optimization. Nevertheless, there always exists a case where the entropy is indeed truly minimized, and the error reaches zero (see (min) in the table) in Example 2/2S and Example 3/3S. In summary, CSIB consistently performs better in different spurious correlations modes and is especially more effective than IB-ERM and IB-IRM when the spurious features enjoy much smaller entropy than the invariant features do.

Experiments on Color MNIST Dataset
In this experiment, we set up a binary classification task for digit recognition and identify whether the digit is less than five or more than five. We use real-world dataset, the MNIST database of handwritten digits (http://yann.lecun.com/exdb/mnist/), for the construction. Following our learning setting, we use color information as the spurious features that correlates strongly with the class label. By construction, the label is more strongly correlated with the color than with the digit in the training environments, but this correlation is broken in the test environment. Specifically, the three designed environments (two training environments and one test environment containing 10,000 points each) of the color MNIST are as follows: first, we define a preliminary binary labelŷ to the image base on the digit:ŷ = 0 for digits 0-4 andŷ = 1 for 5-9. Second, we obtain the final label y by flippingŷ with probability 0.25. Then, we flip the final labels to obtain the color id, where the flipping probabilities with respect to two training environments and one test environment are 0.2 and 0.1, and 0.9. For better understanding, we randomly draw 20 examples for each label from each environment and visualize them in Figure 4.  The classification results on the color MNIST dataset are shown in Table 3. From the results, we can see that both ERM and IB-ERM methods almost surely use the color features to achieve the task. Although IRM and IB-IRM methods have shown some improvements over ERM, only our method can perform better than a random prediction, which demonstrates the effectiveness of CSIB.

Related Works
We divide the works related to OOD generalization into two categories: theory and methods, though some of them belong to both.

Theory of OOD Generalization
Based on different definitions to the distributional changes, we review the corresponding theory by the following three categories.
Based on causality. Due to the close connection between the distributional changes and the interventions discussed in the theory of causality [13,14], the problem of OOD generalization is usually built in the framework of causal learning. The theory states that a response Y is directly caused only by its parents variables X Pa(Y) , and all interventions other that those on Y do not change the conditional distribution of P(Y|X Pa(Y) ). Such theory inspires a popular learning principle-the invariance principle-that aims to discover a set of variables such that they remain invariant to the response Y in all observed environments [15,19,20]. Invariant risk minimization (IRM) [8] is then proposed to learn a feature extractor Φ in an end-to-end way such that the optimal classifier based on the extracted features Φ(X) remains unchanged in each environment. The theory in [8] shows the guarantee of IRM for OOD generalization under some general assumptions but only focuses on the linear regression tasks. Different from the failure analyses of IRM for the classification tasks in [21,22], where the response Y is the cause of the spurious feature, Ahuja et al. [12] analyse another scenario when the invariant feature is the cause of the spurious feature and show that in this case, linear classification is more difficult than linear regression, where the invariance principle itself is insufficient to ensure the success of OOD generalization. They also claim that the assumption of support overlap of invariant features is necessarily needed. They then propose a learning principle of information bottleneckbased invariant risk minimization (IB-IRM) for linear classification, which shows how to address the failures of IRM by adding information bottleneck [16] into the learning. In this work, we closely investigate the conditions identified in [12] and first show that support overlap of invariant features is not necessarily needed for the success of OOD generalization. We further show several failure cases of IB-IRM and propose improved results for it.
Recently, some works tackle the challenge of OOD generalization in the nonlinear regime [23,24]. Commonly, both of them use variational autoencoder (VAE)-based models [25,26] to identify the latent variables from observations in the first stage. Then, these inferring latent variables are separated into two distinct parts of invariant (causal) and spurious (non-causal) features based on different assumptions. Specifically, Lu et al. [23,27] assume that the latent variables conditioned on some accessible side information such as the environment index or class label follow the exponential family distributions, and Liu et al. [24] directly disentangle the latent variables to two different parts during the inferring stage and assume that the marginal distributions of them are independent of each other. These assumptions, however, are rather strong in general. Nevertheless, these solutions aim to capture the latent variables such that the response given these variables is invariant for different environments, which could still fail because the invariance principle itself is insufficient for OOD generalization in the classification tasks, as shown in [12]. In this work, we focus on the linear classification only and show a new theory of a new method that addresses several OOD generalization failures in the linear settings. Our method could extend to the nonlinear regime by combining with the disentangled representation learning [28] or causal representation learning [29]. Specifically, once the latent representations are well disentangled, i.e., the latent features are represented by a linear transform of the causal features and spurious features, we then could apply our method to filter out the spurious features in the latent space such that only causal features remain.
Based on robustness. Different from those based on the causality, where different distributions are generated by intervention on a same SEM and the goal is to discover causal features, the robustness-based methods aim to protect the model against the potential distributional shifts within the uncertainty set, which is usually constrained by f-divergence [30] or Wasserstein distance [31]. This series of works is theoretically addressed by distributionally robust optimization (DRO) under a minimax framework [32,33]. Recently, some works tend to discover the connections between causality and robustness [34]. Although these works show less relevance to us, it is possible that a well-defined measure of distribution divergence could help to effectively extract causal features under the robustness framework. This would be an interesting avenue for future research.
Others. Some other works assume that the distributions (domains) are generated from a hyper-distribution and aim to minimize the average risk estimation error bound [35][36][37]. These works are often built based on the generalization theory under the independent and identically distributed (IID) assumption. The authors in [38] do not make any assumption on the distributional changes and only study the learnability of OOD generalization in a general way. All of these theories do not cover the OOD generalization problem under a single training environment or domain.

Methods of OOD Generalization
Based on the invariance principle. Inspired from the invariance principle [15,19], many methods are proposed by designing various loss to extract features to better satisfy the principle itself. IRMv1 [8] is the first objective to address this in an end-to-end way by adding a gradient penalty to the classifier. Following this work, Krueger et al. [9] suggest penalizing the variance of the risks, while Xie et al. [39] give the same objective but take the square root of the variance, and many other alternatives can also be found [40][41][42]. It is clear that all of these methods aim to find an invariant predictor. Recently, Ahuja et al. [12] found that for the classification problem, finding the invariant predictor is not enough to extract causal features since the features could include spurious information to make the predictor invariant across training environments, and they propose IB-IRM to address such a failure. Similar ideas to IB-IRM can also be found in the work [43,44], where different loss functions are proposed to achieve the same purpose. Specifically, Alesiani et al. [44] also use the information bottleneck (IB) for the help in dropping spurious correlations, but their analyses only focus on the scenario when spurious features are independent from the causal features, which could be considered as a special case of ours. More recently, Wang et al. [45] propose similar ideas to ours but only tackle the situation when the invariant features have the same distribution among all environments. In this work, we further show that IB-IRM could still fail in two cases due to the model only relying on spurious features to meet the task of interest. We then propose a counterfactual supervision-based information bottleneck (CSIB) method to address such failures and show improving results to prior works.
Based on distribution matching. It is worth noting that there are many works focused on learning domain invariant features representations [46][47][48]. Most of these works are inspired by the seminal theory of domain adaptation [49,50]. The goal of these methods is to learn a feature extractor Φ such that the marginal distribution of P(Φ(X)) or the conditional distribution of P(Φ(X)|Y) is invariant across different domains. This is different from the invariance principle, where the goal is to make P(Y|Φ(X)) (or E(Y|Φ(X))) invariant. We refer readers to the papers of [8,51] for better understanding the details of why these distribution-matching-based methods often fail to address OOD generalization.
Others. Other related methods are varied, including by using data augmentation in both image level [52] or feature level [53], by removing spurious correlations through stable learning [54], and by utilizing the inductive bias of neural networks [3,55], etc. Most of these methods are empirically inspired from experiments and are verified on some specific datasets. Recently, empirical studies in [56,57] notice that the real effects of many OOD generalization (domain generalization) methods are weak, which indicates that the benchmark-based evaluation criteria may be inadequate to validate the OOD generalization algorithms.

Conclusions, Limitations and Future Work
In this paper, we focus on the OOD generalization problem of linear classification. We first revisit the fundamental assumptions and results of prior works and show that the condition of invariant features supporting overlap is not necessarily needed for the success of OOD generalization and thus propose a weaker counterpart. Then, we show two failure cases of IB-IRM (as well as ERM, IB-ERM, and IRM) and illustrate its intrinsic causes by theoretical analysis. We further propose a new method-counterfactual supervision-based information bottleneck (CSIB)-and theoretically prove its effectiveness under some weaker assumptions. CSIB works even when accessing data from a single environment and can easily extend to the multi-class problems. Finally, we design several synthetic datasets with our examples for experimental verification. Empirical observations among all comparing methods illustrate the effectiveness of the CSIB.
Since we only take the linear problem into account, including linear representation and linear classifier, any nonlinear case would not be guaranteed by our theoretical results, and thus CSIB may fail. Therefore, the same as prior works (IRM [8] and IB-IRM [12]), the nonlinear challenge is still an unsolved problem [21,22]. We believe this is of great value for investigating in future work since widely used data in the wild are nonlinearly generated. Another fruitful direction is to design a powerful algorithm for entropy minimization during the learning process of CSIB. Currently, we use the variance of features to replace the entropy of the features during optimization. However, variance and entropy are essentially different. A truly effective entropy minimization is the key to the success of CSIB. Another limitation of our method is that we have to require further supervision to the counterfactual examples during the learning process, although it only takes one time for a single step.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: OOD

Appendix A. Experiments Details
In this section, we provide more details on the experiments. The code to reproduce the experiments can be found at https://github.com/szubing/CSIB.

Appendix A.1. Optimization Loss of IB-ERM
The objective function of IB-ERM is as follows: Since the entropy of h e (Φ) = H(Φ(X e )) is hard to estimate by a differential variable that can be optimized by using gradient descent, we follow [12] by using the variance instead of the entropy for optimization. The total loss function is given by with a hyperparameter λ onto it.

Appendix A.2. Experiments Setup
Model, hyperparameters, loss, and evaluation. In all experiments, we follow the same protocol as prescribed by [12,17] for the model / hyperparameter selection, training, and evaluation. Except those specified, for all experiments across three examples and five comparing methods, the model is the same with a linear feature extractor Φ ∈ R d×d followed by a linear classifier w ∈ R d+1 . We use binary cross-entropy loss for classification. All hyperparameters, including the learning rate, the penalty term in IRM, or the λ associated with the Var(Φ) in Equation (A2), etc., are randomly searched and selected by using 20 test samples for validation. The results reported in the main manuscript use three hyperparameter queries of each and average over five data seeds. The results when searching over more hyperparameter values are reported in the supplementary experiments. The search spaces of all the hyperparameters follow the same as in [12,17]. The classification test errors between 0 and 1 are reported.
Compute description. Our computing resource is one GPU of NVIDIA GeForce GTX 1080 Ti with 6 CPU cores of Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz.

Appendix A.3. Supplementary Experiments
The purpose of the first supplementary experiment is to illustrate what the result would be when we increase the number of running seeds in the hyperparameters selection. These results are shown in Table A1, where we increase the number of hyperparameter queries to 10 of each. It is clear that overall, the results of the CSIB in Table A1 are much better and have less fluctuations than those in Table 2, and the conclusions remain almost the same as we have summarized in Section 6.1.2. This further verifies the effectiveness of the CSIB method.
Observation on different settings in Example 1/1S. In our main experiments of Example 1/1S, we set p e = 1 and q = 0 to make the spurious features and the invariant features both linearly separable to confuse each other. Here, we analyse what the result would be if we vary their values. Following [17], we set p e 0 = 0.95, p e 1 = 0.97, p e 2 = 0.99, and p e j ∼ Uniform(0.9, 1) to make spurious features linearly inseparable, and q is set to 0/0.05 to make invariant features linearly separable/inseparable. Table A2 shows the corresponding results. Interestingly, we find that all methods except for IB-IRM have an ideal error rate (the same as the Oracle) when the spurious features are linearly inseparable (p e = 1), even when the invariant features are linearly inseparable too (q = 0.05). Why would this happen? We then remove the linear embedding Φ. The results are presented in Table A3. Comparing the results between Tables A2 and A3, we found there is a significant inductive bias of the neural network, though the model is linear. Further analysis to such observation is out of the scope of this paper, but this would be an interesting avenue for future research.   0.00 ± 0.00 0.00 ± 0.00 0.15 ± 0.20 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 Example 1S 1 No 0 0.00 ± 0.00 0.00 ± 0.00 0.12 ± 0.19 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 Example 1 3 No 0 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 Example 1S 3 No 0 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.01 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 Example 1 6 No 0 0.00 ± 0.00 0.00 ± 0.00 0.30 ± 0.20 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 Example 1S 6 No 0 0.00 ± 0.00 0.00 ± 0.00 0.31 ± 0.20 0.00 ± 0.00 0.04 ± 0.06 0.00 ± 0.00 Then, we look back to Theorem 2. For real data, such as an image, the dimension of spurious features o is often high. Assume different environments enjoy different spurious points randomly; then, from the above observation, there is a high probability that the following events will occur: For any labeling data in the n training environments with n < o/2 (2 is due to the binary label), models could achieve zero training error by relying on spurious features only. This illustrates why prior methods easily fail to address OOD generalization under Assumption 6.

Appendix B. Proofs
Appendix B.1. Preliminary Before our proofs, we first review some useful properties related to the entropy [12,58]. Entropy. For discrete random variable X ∼ P X with support X , its entropy (Shannon entropy) is defined as The differential entropy of the continuous random variable X ∼ P X with support X is given by where p X (x) is the probability density function of the distribution P X . Sometimes, we may confuse using H(X) or h(X) to represent its entropy no matter whether X is discrete or continuous.
Lemma A1. If X and Y are discrete random variables that are independent, then and similar we have H(Z|Y) = H(X). Therefore, This completes the proof.
Lemma A2. If X and Y are continuous random variables that are independent, then and similar, we have h(Z|Y) = h(X). Therefore, This completes the proof.
Lemma A3. If X and Y are discrete random variables that are independent with the supports satisfying 2 ≤ |X | < ∞, 2 ≤ |Y | < ∞, then Proof. From Lemma A1 and due to the symmetry of X and Y, we only need to prove H(X + Y) = H(X). The proof is by contradiction. Suppose H(X + Y) = H(X), then from Equation (A7) it follows that I(X + Y, Y) = 0, thus X + Y ⊥ Y. However, P(Y = y max |X + Y = x max + y max ) = 1, which is different from P(Y = y max ) < 1 (due to |Y| ≥ 2). This contradicts X + Y ⊥ Y.
Lemma A4. If X and Y are continuous random variables that are independent and have a bounded support, then h(X + Y) > max{h(X), h(Y)}.
Proof. From Lemma A2 and due to the symmetry of X and Y, we only need to prove h(X + Y) = h(X). The proof is by contradiction. Suppose h(X + Y) = h(X), then from Equation (A10) it follows that I(X + Y, Y) = 0, thus X + Y ⊥ Y. For any δ > 0, define an event M : x max + y max − δ ≤ X + Y ≤ x max + y max . If M occurs, then Y ≥ y max − δ and X ≥ x max − δ. Thus, P Y (Y ≤ y max − δ|M) = 0. However, we can always choose a δ > 0 that is small enough to make P Y (Y ≤ y max − δ) > 0. This contradicts X + Y ⊥ Y.
Appendix B.2. Proof of Theorem 2 Proof. The proof is trivial. Since two sets ∪ e∈E tr Z e spu (Y e = 1) and ∪ e∈E tr Z e spu (Y e = 0) are linearly separable, there exists a linear classifier w that only relies on spurious features and can achieve zero classification error on each environment. Therefore, w is an invariant predictor across different training environments. In addition, H(Z e inv ) > H(Z e spu ) would make IB-IRM prefer to choose these spurious features. Therefore, w would be an optimal solution of IB-IRM, ERM, IRM, and IB-ERM. However, since w relies on spurious features which may change arbitrary in unseen environments, it thus fails to solve OOD generalization.

Appendix B.3. Proof of Theorem 3
Proof. Assume Φ * ∈ R c×d and w * are the feature extractor and classifier learned by IB-ERM. Consider the feature variable extracted by Φ * as We first show that Φ inv = 0 or Φ spu = 0. We prove this by contradiction. Assume Φ inv = 0 and Φ spu = 0. By observing that a solution of Φ inv = 1, Φ spu = 0, w * = w * inv could make the average training error to q; therefore any solution returned by IB-ERM should also achieve the error no larger than q (because rth = q in the constraint of Equation (12)). Therefore w * = 0.

1.
In the case when each e ∈ E tr follows Assumption 4 of Z e spu ← AZ e inv + W e , we have Then, for any z = (z e inv , z e spu ) of 1(w * inv · z e inv ) = 1, we must have w * · (Φ inv + Φ spu A)z e inv + w * · Φ spu w e ≥ 0 for any w e to make error no larger than q. Since W e is zero mean with at least two distinct points in each component, we can conclude that w * · (Φ inv + Φ spu A)z e inv ≥ 0. Similarly, for any z = (z e inv , z e spu ) of 1(w * inv · z e inv ) = 0, we have w * · (Φ inv + Φ spu A)z e inv < 0. From Lemma A3 or Lemma A4, we obtain H((Φ inv + Φ spu A)Z e inv + Φ spu W e ) > H((Φ inv + Φ spu A)Z e inv ). Therefore, there exists a more optimal solution to IB-ERM with zero weight to Z e spu , which contradicts the assumption.
From Lemma A3 or Lemma A4, we obtain H((Φ spu + Φ inv A)Z e spu + Φ inv W e ) > H((Φ spu + Φ inv A)Z e spu ). In addition, the spurious features are assumed to be linearly separable. Therefore, there exists a more optimal solution to IB-ERM with zero weight to Z e inv , which contradicts the assumption.
Then, for any z = (z e inv , z e spu ) of 1(w * inv · z e inv ) = 1, we must have w * · Φ inv z e inv + w * · Φ spu w e 1 y e + w * · Φ spu w e 0 (1 − y e ) ≥ 0 for any w e 1 and w e 0 to make error no larger than q. Since W e 1 and W e 0 are both zero mean variables with at least two distinct points in each component, we can conclude that w * · Φ inv z e inv ≥ 0; Similarly, for any z = (z e inv , z e spu ) of 1(w * inv · z e inv ) = 0, we have w * · Φ inv z e inv < 0. From Lemma A3 or Lemma A4, we obtain H(Φ inv Z e inv + Φ spu W e 1 Y e + Φ spu W e 0 (1 − Y e )) > H(Φ inv Z e inv ). Therefore, there exists a more optimal solution to IB-ERM with zero weight to Z e spu , which contradicts the assumption.
So far, we have proved that the feature extractor Φ * learned by IB-ERM would never extract both spurious features and invariant features together. Then, we perform singular value decomposition (SVD) to the Φ * as Let S ∈ R d×d be the orthogonal matrix. Set r as the rank of the matrix Φ * , i.e., r = Rank(Φ * ), and let V T 1 S = [V 1 , V 2 ] with V 1 ∈ R r×m and V 2 ∈ R r×o , and V T 2 S = [V 1 , V 2 ] with V 1 ∈ R (d−r)×m and V 2 ∈ R (d−r)×o , then Since Φ * X e contains the information either from spurious features or from invariant features, we must have U 1 Λ 1 V 1 = 0 or U 1 Λ 1 V 2 = 0, and thus, V 1 = 0 or V 2 = 0 due to Rank(U 1 Λ 1 ) = r. If V 2 = 0, then Φ * extract invariant features only. Otherwise when V 1 = 0, we decompose the V T S by Since V T and S are both the orthogonal matrix, V T S is also orthogonal; thus V 1 = 0 ⇒ V T 2 V 2 = 0, and then Rank(V 2 ) = Rank([V 2 ; V 2 ]) − Rank(V 2 ) = o − r (note that r ≤ min{m, o}). Then, Therefore, by running the CSIB for one iteration, the rank of spurious features would be decreased by r > 0. This would result in zero weight to spurious features by finite runs of CSIB. Then, we intend to show why the counterfactual supervision step could help to distinguish whether V 1 is 0 or not. For a specific instance x = S[z inv ; z spu ], let two new features be z 1 and z 2 , then do(z 1 z 1 1:r + V T 1 (V 1 z inv + V 2 z spu ); V T 2 z 1 1:r + V T 2 (V 1 z inv + V 2 z spu )] = [z inv ; V T 2 z 1 1:r + V T 2 V 2 z spu ], and similarly we have S −1 x 2 = [z inv ; V T 2 z 2 1:r + V T 2 V 2 z spu ]. Therefore, the ground truths of x 1 and x 2 are the same. On other hand, if V 1 = 0, then V 2 = 0, and S −1 x 1 = S −1 Vz 1 = S −1 V[z 1 1:r ; V 1 z inv + V 2 z spu ] = (V T S) T [z 1 1:r ; and similarly we have S −1 x 2 = [V T 1 z 2 1:r + V T 1 V 1 z inv ; z spu ]. Since z 1 1:r = −z 2 1:r and their magnitudes are large enough to make sgn(w * inv · (V T 1 z 1 1:r + V T 1 V 1 z inv )) = sgn(w * inv · (V T 1 z 2 1:r + V T 1 V 1 z inv )); thus the ground truths of x 1 and x 2 would be different. Therefore, the counterfactual supervision step could help to detect whether invariant features or spurious features are extracted by using a single sample only.
Finally, when only invariant features are extracted by Φ, the training error is minimized, i.e., w * Φ inv ∈ arg min f E P [l( f (Z tr inv ), Y tr )]. Then, based on our assumption to the OOD environments (Assumptions 8), i.e., ∀e ∈ E ood , F l (P(Z tr inv , Y tr )) ⊆ F l (P(Z e inv , Y e )), therefore, for any e ∈ E ood , we have E P [l((X e , Y e ), w * Φ)] = E P [l((Z e inv , Y e ), w * Φ inv )] = E P [l((Z tr inv , Y tr ), w * Φ inv )] = q. It is worth noting that the proof of Theorem 3 does not rely on how many labels there would be, so it is easily extended to the multi-class classification case as long as the corresponding assumptions and conditions are satisfied.