#
Frustratingly Easy Environment Discovery for Invariant Learning^{ †}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- We present a novel environment discovery approach using the Generalized Cross-Entropy (GCE) loss function, ensuring the reference classifier leverages spurious correlations. Subsequently, we partition the dataset into two distinct environments based on the performance of the reference classifier and employ invariant learning algorithms to remove biases.
- We study the environments in invariant learning from the perspective of the “Environment Invariance Constraint” (EIC), which forms the foundation for FEED.
- We introduce the Square-MNIST dataset to evaluate the ability of our model in more challenging scenarios where the true causal features (strokes) and spurious features (squares) closely resemble each other. Our evaluation demonstrates the superior performance of FEED compared to other environment discovery approaches.

**Figure 1.**Training dynamics for CMNIST benchmark. For bias-aligned samples, the label y can be easily predicted based on the spurious associations, however, for other samples, this spurious correlation does not apply. While the loss for bias-aligned samples decreases quickly, for other samples the loss increases at early epochs.

## 2. Related Works

**Bias Removal without Environment Labels.**Since obtaining environments or group annotations can be costly or infeasible, various methods have been proposed to remove biases by exploiting the mistakes of an ERM model (also known as reference model). One line of work utilizes these mistakes to reweigh the data for training the primary model [7,13,19,20,21,22]. For example, [7] up-weight the error samples from the reference model or [13] determine importance weights based on the relative cross-entropy losses of the reference and primary models. These methods, however, differ from ours because instead of training a classifier with curated importance weights, we trained an invariant predictor. Another line of work leverages the mistakes to apply an invariant learning algorithm [6,23,24]. Refs. [23,24] both train a GroupDRO model by inferring subclasses from the representations learned by the reference model. The most closely related work to our paper is EIIL [6], which infers the environments for invariant learning by maximizing the regularization term of IRM. The main drawback of the above-mentioned methods is the assumption that the ERM model always learns the shortcut. This is the case in benchmarks like CMNIST, which are specifically created to frustrate ERM [25]. However, we show that these methods fail miserably on simpler tasks that do not follow the assumption. Another group of works trains a separate network to find either sample weights or environment assignment probability. Ref. [26], for instance, extends DRO using an auxiliary model to compute the importance weights. However, rather than training an online fair model for accurate predictions within a given distribution, we aim to find data partitions that allow us to employ invariant learning techniques to address distribution shifts [6]. ZIN [27] also uses an auxiliary network to learn a partition function based on IRM. This structure cannot be generalized to provide environments for other robust algorithms. Ref. [28] also proposes a framework to partition the data. However, their method is limited to the case where the input can be decomposed into invariant and variant features. Other works create domains for adversarial training [29], but we focus on invariant learning due to the limitations of adversarial methods.

**Invariant Learning.**Recent studies have addressed biases by learning invariances in training data. Motivated by casual discovery, IRM [3] and its variants [25,30,31,32,33] learn a representation such that the optimal classifier built on top is the same for all training environments. LISA [34] also learns invariant predictors via selective mix-up augmentation across different environments. Other methods like Fish [35], IGA [36], and Fishr [37] introduce gradient alignment constraints across training environments. Another large class of methods for generalizing beyond training data is distributionally robust optimization (DRO) [5,38,39,40]. REx [17] and GroupDRO [5] are notable instances of DRO methods, aiming to find a solution that performs equally well across all environments. The success of the above-mentioned methods depends on environment partitions or group annotations. However, these annotations are often unavailable or expensive in practice. Beyond the methods discussed above, adversarial training is another popular approach for learning invariant or conditionally invariant representations [15,16,29,41,42]. However, the performance of adversarial training degrades in settings where distribution shift affects the marginal distribution of labels [3,42]. Due to these limitations, recent works have focused on learning invariant predictors.

## 3. Frustratingly Easy Environment Discovery

Algorithm 1 FEED Algorithm |

Input: dataset $D={\left\{({x}_{i},{y}_{i})\right\}}_{i=1}^{N}$, model M |

Output: environments ${e}_{1}$, ${e}_{2}$ |

1: Randomly initialize ${e}_{1}$ and ${e}_{2}$ using $\mathtt{np}.\mathtt{random}.\mathtt{randint}$ |

2: for epochs do |

3: train M by minimizing ${\mathbb{E}}_{p\left((x,y)\right|{e}_{1})}\left(\right)open="["\; close="]">{l}_{GCE}(M\left(x\right),y)$ |

4: for $({x}_{i},{y}_{i})\in D$ do |

5: if ${l}_{CE}\left(\right)open="("\; close=")">M\left({x}_{i}\right),{y}_{i}$ then |

6: Assign $({x}_{i},{y}_{i})$ to ${e}_{1}$ |

7: else |

8: Assign $({x}_{i},{y}_{i})$ to ${e}_{2}$ |

9: end if |

10: end for |

11: end for |

12: return ${e}_{1}$, ${e}_{2}$ |

## 4. Experiments

#### 4.1. Dataset

#### 4.2. Implementation Details

#### 4.3. Results and Discussions

**Analysis of Discovered Environments.**We studied how samples from different groups were distributed across the environments created by FEED, as shown in Table 2. Note that we did not use such group annotations in FEED. We expected that ${e}_{1}$ contained samples where the shortcut exists in images. For instance, in CMNIST, we observed that ${e}_{1}$ only contained the samples where the label and color (spurious attribute) agree. This property is reasonable since in this dataset, the digit color and target agree for 85% of the training images (on average). All other samples were assigned to ${e}_{2}$ where this shortcut performed reverse and color and label disagree. Thus, $\mathbb{E}\left[y\right|\mathsf{\Psi},e]$ varies substantially, i.e., the correlation between color and target is unstable and varies across environments. However, the correlation between the digit shape and the target remains invariant. Consequently, when we applied an invariant learning algorithm, the model could satisfy the EIC unless it learned the digit shape. On the other hand, in the standard CMNIST training environments, there is still a slight chance of assuming an invariant association between color and target across the environments (about 10%). For the Waterbirds and CelebA, we observe similar behavior. For instance, in Waterbirds, only 56 training images from waterbirds on land are available, out of which 50 images are assigned to ${e}_{2}$. We further analyzed those six images that were assigned to ${e}_{1}$ (shown in Figure 3). As can be seen, most of these samples have backgrounds resembling water, i.e., they are similar to waterbirds on water, which are mainly assigned to ${e}_{1}$ (861 vs. 195). This may explain why these six images are assigned to ${e}_{1}$. Note that waterbirds on the water are mostly assigned to ${e}_{1}$ since it is intended to contain the samples with the prevalent shortcut.

**Group Sufficiency Gap.**Another way to explain the efficacy of our environments is by evaluating the group sufficiency gap $g=|\mathbb{E}\left[Y\right|\mathsf{\Psi}\left(x\right),{e}_{1}]-\mathbb{E}\left[Y\right|\mathsf{\Psi}\left(x\right),{e}_{2}]|$, defined based on the EIC [6]. This metric measures the degree to which the environment assignments can violate the EIC. We had to find a partitioning strategy that maximized g; i.e., greater g means higher variation in environments, which can lead to a tighter invariant set. In each created environment, the classifier could rely solely on the spurious attribute a to make predictions, i.e. $\mathsf{\Psi}\left(x\right)=a$. Then the gap would be $g=|\mathbb{E}\left[Y\right|a,{e}_{1}]-\mathbb{E}\left[Y\right|a,{e}_{2}]|$. In environment ${e}_{1}$, all digits [5,9] ($y=1$) are red ($a=1$) and digits [0,4] ($y=0$) are green ($a=0$), while in environment ${e}_{2}$, all digits [5,9] ($y=1$) are green ($a=0$), and digits [0,4] ($y=0$) are red ($a=1$). In this case, we had $g=1$, which is its maximum value. On the other hand, for the standard CMNIST environments, the gap is 0.1 [6], and for the EIIL environments $g=0.83$. The proof is provided in the Appendix A, Appendix B and Appendix C.

**Why are ERM-based models not sufficient?**EIIL [6] assumes that the reference model, which is trained using ERM, is learning the shortcut in the training dataset. Furthermore, recent work like JTT [7] claims to achieve out-of-distribution generalization by discovering the errors of an ERM model and then upweighing them during the next steps of training. Also, one may ask whether similar techniques like JTT, for example, can be used to partition the dataset and create environments for invariant learning methods. Although this strategy often works on datasets that were constructed to showcase out-of-distribution problems, assuming that the reference ERM model always learns the easy shortcuts is unrealistic. To illustrate this claim, we constructed a variant of the CMNIST dataset where the robust feature (digit shape) was more predictive than the spurious feature (digit color) by decreasing the label noise level to 10% [25]. Table 3 compares the performance of different methods on this new dataset, called INVERSE-CMNIST. While ERM method failed on standard CMNIST (Table 1), it performed well on INVERSE-CMNIST because relying on the most predictive features (digit shape) is a good strategy for this task [25]. Other methods fail to achieve a good performance on INVERSE-CMNIST because they are based on the assumption that the ERM model is learning the shortcut. EIIL also cannot create useful environments in this case. In contrast, FEED utilizes the GCE loss function to encourage the model to learn the shortcut and increase variation among environments. In this experiment, FEED assigned environments exactly similar to the standard CMNNIST, shown in Table 2.

**What if we have more challenging shortcuts?**In the introduced SMNIST dataset, although the square can serve as a shortcut, it is not as straightforward as using the color in CMNIST or the background in Waterbirds. This is because, in the feature space, the square is at a similar level as the digit strokes, making it more challenging to distinguish from the digits themselves. The results are shown in Table 4. The results indicate that the ERM model learns a mix of the shortcut and main task. Therefore, similar to the INVERSE-CMNIST, ERM-based models like EIIL cannot perform well in this task. In contrast, FEED can effectively create useful environments, although this is a challenging scenario for FEED as well. We started updating environments after a few epochs (five epochs) of training with the initial random assignments in order to give the model enough time to learn the challenging shortcut before updating the partitioning. By repeating the partitioning experiment 10 times, the group sufficiency gap for the environments created by FEED was $g=0.98$ on average, while for EIIL, it was $g=0.74$. Additionally, our created environments improve the performance of the invariant learning algorithms. This challenging dataset also sheds light on the effect of GCE in FEED. We repeated this experiment by replacing the GCE with standard cross-entropy (CE), as shown in Table 4. In this case, CE FEED was unable to identify the shortcut and partitioned the dataset based on the target. Therefore, invariant algorithms cannot learn a tight invariant set ($g=0.5$).

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Loss Dynamics for CMNIST

**Figure A1.**(

**a**) Training dynamics for the standard CMNIST benchmark. For bias-aligned samples, the label y can be (easily) predicted based on the spurious associations that are prevalent in the training dataset. For other samples, this spurious correlation is reversed. While the loss for bias-aligned samples decreases quickly, for other samples the loss goes up at early epochs. (

**b**) Training dynamics for predicting the color in the CMNIST dataset. We used color as the training target and digit shapes are considered as spurious attributes. Therefore, the original task is easier to learn and the loss dynamics for all samples are similar (we used batch training).

## Appendix B. Using Accuracy as Difficulty Score

## Appendix C. Group Sufficiency Gap

**CMNIST Standard Environments:**For the standard environment assignment in the CMNIST benchmark, in environment ${e}_{1}$, 90% of digits [5,9] ($y=1$) are red ($a=1$) and 90% of digits [0,4] ($y=0$) are green ($a=0$), while in environment ${e}_{2}$, this correlation is 80%. Table A1 shows this distribution more formally. In this case, we can compute the gap as follows:

**Table A1.**Distribution of each class in created environments. $a=0$ and $a=1$ corresponds to green and red. The numbers show the composition of samples for each class within the environments. Note that this table is different from Table 2.

Standard CMNIST | CMNIST FEED | CMNIST EIIL | SMNIST FEED | SMNIST EIIL | ||||||
---|---|---|---|---|---|---|---|---|---|---|

${\mathit{e}}_{\mathbf{1}}$ | ${\mathit{e}}_{\mathbf{2}}$ | ${\mathit{e}}_{\mathbf{1}}$ | ${\mathit{e}}_{\mathbf{2}}$ | ${\mathit{e}}_{\mathbf{1}}$ | ${\mathit{e}}_{\mathbf{2}}$ | ${\mathit{e}}_{1}$ | ${\mathit{e}}_{\mathbf{2}}$ | ${\mathit{e}}_{\mathbf{1}}$ | ${\mathit{e}}_{\mathbf{2}}$ | |

$(a=0,y=0)$ | 90.0 | 80.0 | 100.0 | 0.0 | 93.0 | 7.0 | 99.93 | 4.95 | 85.16 | 9.68 |

$(a=1,y=0)$ | 10.0 | 20.0 | 0.0 | 100.0 | 7.0 | 93.0 | 0.07 | 95.05 | 14.84 | 90.32 |

$(a=0,y=1)$ | 10.0 | 20.0 | 0.0 | 100.0 | 6.0 | 89.0 | 0.31 | 98.64 | 15.15 | 89.63 |

$(a=1,y=1)$ | 90.0 | 80.0 | 100.0 | 0.0 | 94.0 | 11.0 | 99.69 | 1.36 | 84.85 | 10.37 |

**FEED for CMNIST:**As shown in Table 2 and Table A1, FEED can split the CMNIST dataset based on the spurious attribute. It discovers an environment assignment based on the agreement between the label y and spurious attribute a. Therefore, the group sufficiency gap would be:

**EIIL for CMNIST:**The distribution of each class in the environments that EIIL [6] creates, is shown in Table A1. We can find the group sufficiency gap as follows:

**FEED for SquareMNIST:**We repeated this experiment 10 times and then took the average of the environment assignments. This is to make sure our results are stable and reproducible. The distribution for each environment is given in Table 2. In this case, the group sufficiency gap would be:

**EIIL for SquareMNIST:**According to Table A1, the group sufficiency gap can be computed as follows:

## References

- Geirhos, R.; Jacobsen, J.H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut learning in deep neural networks. Nat. Mach. Intell.
**2020**, 2, 665–673. [Google Scholar] [CrossRef] - Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR)
**2021**, 54, 1–35. [Google Scholar] [CrossRef] - Arjovsky, M.; Bottou, L.; Gulrajani, I.; Lopez-Paz, D. Invariant risk minimization. arXiv
**2019**, arXiv:1907.02893. [Google Scholar] - Arpit, D.; Jastrzębski, S.; Ballas, N.; Krueger, D.; Bengio, E.; Kanwal, M.S.; Maharaj, T.; Fischer, A.; Courville, A.; Bengio, Y.; et al. A closer look at memorization in deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 233–242. [Google Scholar]
- Sagawa, S.; Koh, P.W.; Hashimoto, T.B.; Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv
**2019**, arXiv:1911.08731. [Google Scholar] - Creager, E.; Jacobsen, J.H.; Zemel, R. Environment inference for invariant learning. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 2189–2200. [Google Scholar]
- Liu, E.Z.; Haghgoo, B.; Chen, A.S.; Raghunathan, A.; Koh, P.W.; Sagawa, S.; Liang, P.; Finn, C. Just train twice: Improving group robustness without training group information. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 6781–6792. [Google Scholar]
- Howard, F.M.; Dolezal, J.; Kochanny, S.; Schulte, J.; Chen, H.; Heij, L.; Huo, D.; Nanda, R.; Olopade, O.I.; Kather, J.N.; et al. The impact of site-specific digital histology signatures on deep learning model accuracy and bias. Nat. Commun.
**2021**, 12, 1–13. [Google Scholar] [CrossRef] [PubMed] - Larrazabal, A.J.; Nieto, N.; Peterson, V.; Milone, D.H.; Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl. Acad. Sci. USA
**2020**, 117, 12592–12594. [Google Scholar] [CrossRef] [PubMed] - Oakden-Rayner, L.; Dunnmon, J.; Carneiro, G.; Ré, C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM Conference on Health, Inference, and Learning, Toronto, ON, Canada, 2–4 April 2020; pp. 151–159. [Google Scholar]
- Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability and Transparency, PMLR, New York, NY, USA, 23–24 February 2018; pp. 77–91. [Google Scholar]
- Krco, N.; Laugel, T.; Loubes, J.M.; Detyniecki, M. When Mitigating Bias is Unfair: A Comprehensive Study on the Impact of Bias Mitigation Algorithms. arXiv
**2023**, arXiv:2302.07185. [Google Scholar] - Nam, J.; Cha, H.; Ahn, S.; Lee, J.; Shin, J. Learning from failure: De-biasing classifier from biased classifier. Adv. Neural Inf. Process. Syst.
**2020**, 33, 20673–20684. [Google Scholar] - Krasanakis, E.; Spyromitros-Xioufis, E.; Papadopoulos, S.; Kompatsiaris, Y. Adaptive sensitive reweighting to mitigate bias in fairness-aware classification. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 853–862. [Google Scholar]
- Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7167–7176. [Google Scholar]
- Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. Adv. Neural Inf. Process. Syst.
**2018**, 31, 1647–1657. [Google Scholar] - Krueger, D.; Caballero, E.; Jacobsen, J.H.; Zhang, A.; Binas, J.; Zhang, D.; Le Priol, R.; Courville, A. Out-of-distribution generalization via risk extrapolation (rex). In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 5815–5826. [Google Scholar]
- Srivastava, M.; Hashimoto, T.; Liang, P. Robustness to spurious correlations via human annotations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 9109–9119. [Google Scholar]
- Dagaev, N.; Roads, B.D.; Luo, X.; Barry, D.N.; Patil, K.R.; Love, B.C. A too-good-to-be-true prior to reduce shortcut reliance. arXiv
**2021**, arXiv:2102.06406. [Google Scholar] [CrossRef] - Rosenfeld, E.; Ravikumar, P.; Risteski, A. Domain-adjusted regression or: Erm may already learn features sufficient for out-of-distribution generalization. arXiv
**2022**, arXiv:2202.06856. [Google Scholar] - Idrissi, B.Y.; Arjovsky, M.; Pezeshki, M.; Lopez-Paz, D. Simple data balancing achieves competitive worst-group-accuracy. In Proceedings of the Conference on Causal Learning and Reasoning, PMLR, Eureka, CA, USA, 11–13 April 2022; pp. 336–351. [Google Scholar]
- Kirichenko, P.; Izmailov, P.; Wilson, A.G. Last layer re-training is sufficient for robustness to spurious correlations. arXiv
**2022**, arXiv:2204.02937. [Google Scholar] - Bao, Y.; Chang, S.; Barzilay, R. Predict then interpolate: A simple algorithm to learn stable classifiers. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 640–650. [Google Scholar]
- Sohoni, N.; Dunnmon, J.; Angus, G.; Gu, A.; Ré, C. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. Adv. Neural Inf. Process. Syst.
**2020**, 33, 19339–19352. [Google Scholar] - Zhang, J.; Lopez-Paz, D.; Bottou, L. Rich feature construction for the optimization-generalization dilemma. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 26397–26411. [Google Scholar]
- Lahoti, P.; Beutel, A.; Chen, J.; Lee, K.; Prost, F.; Thain, N.; Wang, X.; Chi, E. Fairness without demographics through adversarially reweighted learning. Adv. Neural Inf. Process. Syst.
**2020**, 33, 728–740. [Google Scholar] - Yong, L.; Zhu, S.; Tan, L.; Cui, P. ZIN: When and How to Learn Invariance Without Environment Partition? Adv. Neural Inf. Process. Syst.
**2022**, 35, 24529–24542. [Google Scholar] - Liu, J.; Hu, Z.; Cui, P.; Li, B.; Shen, Z. Heterogeneous risk minimization. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 6804–6814. [Google Scholar]
- Matsuura, T.; Harada, T. Domain generalization using a mixture of multiple latent domains. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11749–11756. [Google Scholar]
- Bae, J.H.; Choi, I.; Lee, M. Meta-learned invariant risk minimization. arXiv
**2021**, arXiv:2103.12947. [Google Scholar] - Lin, Y.; Dong, H.; Wang, H.; Zhang, T. Bayesian invariant risk minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16021–16030. [Google Scholar]
- Wald, Y.; Feder, A.; Greenfeld, D.; Shalit, U. On calibration and out-of-domain generalization. Adv. Neural Inf. Process. Syst.
**2021**, 34, 2215–2227. [Google Scholar] - Lin, Y.; Zhu, S.; Cui, P. ZIN: When and How to Learn Invariance by Environment Inference? arXiv
**2022**, arXiv:2203.05818. [Google Scholar] - Yao, H.; Wang, Y.; Li, S.; Zhang, L.; Liang, W.; Zou, J.; Finn, C. Improving out-of-distribution robustness via selective augmentation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 25407–25437. [Google Scholar]
- Shi, Y.; Seely, J.; Torr, P.H.; Siddharth, N.; Hannun, A.; Usunier, N.; Synnaeve, G. Gradient matching for domain generalization. arXiv
**2021**, arXiv:2104.09937. [Google Scholar] - Koyama, M.; Yamaguchi, S. Out-of-distribution generalization with maximal invariant predictor. arXiv
**2020**, arXiv:2008.01883. [Google Scholar] - Rame, A.; Dancette, C.; Cord, M. Fishr: Invariant gradient variances for out-of-distribution generalization. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 18347–18377. [Google Scholar]
- Sagawa, S.; Raghunathan, A.; Koh, P.W.; Liang, P. An investigation of why overparameterization exacerbates spurious correlations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 8346–8356. [Google Scholar]
- Zhang, J.; Menon, A.; Veit, A.; Bhojanapalli, S.; Kumar, S.; Sra, S. Coping with label shift via distributionally robust optimisation. arXiv
**2020**, arXiv:2010.12230. [Google Scholar] - Ben-Tal, A.; El Ghaoui, L.; Nemirovski, A. Robust Optimization; Princeton University Press: Princeton, NJ, USA, 2009; Volume 28. [Google Scholar]
- Ganin, Y.; Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 1180–1189. [Google Scholar]
- Zhao, H.; Des Combes, R.T.; Zhang, K.; Gordon, G. On learning invariant representations for domain adaptation. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 7523–7532. [Google Scholar]
- Pezeshki, M.; Kaba, O.; Bengio, Y.; Courville, A.C.; Precup, D.; Lajoie, G. Gradient starvation: A learning proclivity in neural networks. Adv. Neural Inf. Process. Syst.
**2021**, 34, 1256–1272. [Google Scholar] - Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst.
**2018**, 31, 8792–8802. [Google Scholar] - Rockafellar, R.T.; Uryasev, S. Optimization of Conditional Value-at-Risk. J. Risk
**2000**, 2, 21–42. [Google Scholar] [CrossRef] - Sun, B.; Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Proceedings, Part III 14. pp. 443–450. [Google Scholar]

**Figure 2.**Sample training and test images for our datasets. The spurious correlation between the label y and the attribute a changes at test time, making the tasks challenging.

**Figure 3.**Waterbirds on land images that are assigned to ${e}_{1}$. In most of them, the background is or resembles water.

**Figure 4.**Test Accuracy for CMNIST with varying levels of label noise. While EIIL can only perform well under high noise, FEED consistently performs well. After both EIIL and FEED, IRM is used as the invariant learning algorithm.

**Table 1.**Test accuracy. Compared to EIIL, the environments created by FEED substantially improve worst-group accuracy. GroupDRO sets an upper bound since it assumes access to group annotations. Since environment labels for the Waterbirds and CelebA datasets are unavailable, IRM and REx are not applicable. On the CMNIST dataset, although the training environments are available, our created environments improved the performance. Experiments were repeated five times.

CMNIST | WaterBirds | CelebA | ||||
---|---|---|---|---|---|---|

Avg. Acc. | Worst-Group Acc. | Avg. Acc. | Worst-Group Acc. | Avg. Acc. | Worst-Group Acc. | |

ERM | 17.1 ± 0.4% | 8.9 ± 1.8% | 97.3 ± 0.2% | 60.3 ± 1.9% | 95.6 ± 0.2% | 47.2 ± 3.7% |

LfF | 42.7 ± 0.5% | 33.2 ± 2.2% | 91.2 ± 0.7% | 78.0 ± 2.3% | 85.1 ± 0.4% | 72.2 ± 1.4% |

JTT | 16.3 ± 0.8% | 12.5 ± 2.4% | 93.3 ± 0.7% | 86.7 ± 1.2% | 88.0 ± 0.3% | 81.1 ± 1.7% |

GEORGE | 12.8 ± 2.0% | 9.2 ± 3.6% | 95.7 ± 0.5% | 76.2 ± 2.0% | 94.6 ± 0.2% | 54.9 ± 1.9% |

CVar DRO | 33.2 ± 0.5% | 27.9 ± 1.1% | 96.0 ± 1.0% | 75.9 ± 2.2% | 82.5 ± 0.6% | 64.4 ± 2.9% |

Fish | 46.9 ± 0.9% | 35.6 ± 1.5% | 85.6 ± 0.8% | 64.0 ± 1.7% | 93.1 ± 0.4% | 61.2 ± 1.8% |

SD | 68.4 ± 1.1% | 62.3 ± 1.4% | 76.8 ± 1.3% | 71.8 ± 1.8% | 91.6 ± 1.3% | 83.2 ± 2.0% |

CORAL | 65.1 ± 2.5% | 60.2 ± 4.1% | 90.3 ± 1.1% | 79.8 ± 2.5% | 93.8 ± 0.9% | 76.9 ± 3.6% |

IRM | 66.9 ± 1.1% | 58.1 ± 3.7% | – | – | – | – |

vREx | 68.7 ± 0.7% | 63.8 ± 2.8% | – | – | – | – |

EIIL+IRM | 68.4 ± 0.8% | 16.7 ± 14.2% | 90.3 ± 0.2% | 63.1 ± 1.0% | 72.5 ± 0.1% | 54.0 ± 0.8% |

EIIL+vREx | 57.4 ± 0.8% | 14.7 ± 11.8% | 89.7 ± 0.8% | 65.2 ± 3.4% | 76.4 ± 0.7% | 54.9 ± 2.6% |

EIIL+GroupDRO | 44.4 ± 1.0% | 35.2 ± 8.2% | 96.9 ± 0.8% | 78.7 ± 1.0% | 90.7 ± 0.5% | 71.3 ± 0.9% |

FEED+IRM (ours) | 70.4 ± 0.02% | 69.7 ± 0.8% | 92.3 ± 0.2% | 88.4 ± 0.9% | 86.0 ± 0.5% | 81.3 ± 1.4% |

FEED+vREx (ours) | 71.1 ± 0.08% | 69.1 ± 1.2% | 93.3 ± 0.3% | 88.6 ± 1.0% | 86.9 ± 0.8% | 83.7 ± 1.4% |

FEED+GroupDRO (ours) | 71.4 ± 0.02% | 71.0 ± 0.05% | 90.0 ± 0.3% | 88.0 ± 1.2% | 87.3 ± 0.6% | 84.3 ± 2.0% |

GroupDRO | 71.4 ± 0.02% | 71.0 ± 0.05% | 93.5 ± 0.3% | 91.4 ± 1.1% | 92.9 ± 0.2% | 88.9 ± 2.3% |

**Table 2.**Distribution of each group in created environments. $a=0$ and $a=1$ corresponds to green and red for CMNIST, land and water background in Waterbirds, and female and male for CelebA, respectively. The numbers show how each group is distributed in environments.

CMNIST | WaterBirds | CelebA | ||||
---|---|---|---|---|---|---|

${\mathit{e}}_{\mathbf{1}}$ | ${\mathit{e}}_{\mathbf{2}}$ | ${\mathit{e}}_{\mathbf{1}}$ | ${\mathit{e}}_{\mathbf{2}}$ | ${\mathit{e}}_{\mathbf{1}}$ | ${\mathit{e}}_{\mathbf{2}}$ | |

$(a=0,y=0)$ | 100.0 | 0.0 | 93.8 | 7.2 | 95.5 | 4.5 |

$(a=1,y=0)$ | 0.0 | 100.0 | 16.9 | 83.1 | 99.5 | 0.5 |

$(a=0,y=1)$ | 0.0 | 100.0 | 10.7 | 89.3 | 82.8 | 17.2 |

$(a=1,y=1)$ | 100.0 | 0.0 | 81.5 | 18.5 | 30.5 | 69.5 |

**Table 3.**Test accuracy for INVERSE-CMNIST. Although shortcut exists in the dataset, ERM can perform well. Hence, ERM-based models cannot achieve a good generalization. FEED can create effective environment partitioning that helps invariant learning algorithms.

Avg. Acc. | Worst-Group Acc. | |
---|---|---|

ERM | 72.1% | 68.1% |

LfF | 34.5% | 13.3% |

JTT | 26.1% | 18.1% |

CVar DRO | 38.4% | 35.1% |

SD | 79.1% | 74.9% |

IRM | 78.3% | 75.3% |

vREx | 83.5% | 81.7% |

EIIL+IRM | 42.8% | 12.6% |

EIIL+vREx | 42.5% | 6.0% |

EIIL+GroupDRO | 12.9% | 2.5% |

FEED+IRM (ours) | 85.6% | 84.7% |

FEED+vREx (ours) | 85.9% | 85.2% |

FEED+GroupDRO (ours) | 86.1% | 85.4% |

GroupDRO | 86.1% | 85.4% |

**Table 4.**Test accuracy for SquareMNIST. Environments created by FEED enhance the accuracy of the invariant learning algorithms. Also, using GCE in FEED helps it find the shortcut in order to effectively partition the dataset.

Avg. Acc. | Worst-Group Acc. | |
---|---|---|

ERM | 37.3% | 4.7% |

LfF | 46.1% | 43.3% |

JTT | 47.6% | 23.3% |

CVar DRO | 49.0% | 40.1% |

IRM | 41.6% | 5.8% |

vREx | 54.7% | 12.2% |

EIIL+IRM | 57.0% | 32.0% |

EIIL+vREx | 58.9% | 38.2% |

EIIL+GroupDRO | 67.6% | 57.0% |

CE FEED+IRM | 36.8% | 3.7% |

CE FEED+vREx | 33.7% | 3.2% |

CE FEED+GroupDRO | 35.6% | 8.8% |

FEED+IRM (ours) | 69.8% | 65.0% |

FEED+vREx (ours) | 69.2% | 63.4% |

FEED+GroupDRO (ours) | 71.3% | 65.0% |

GroupDRO | 70.5% | 67.8% |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zare, S.; Nguyen, H.V.
Frustratingly Easy Environment Discovery for Invariant Learning. *Comput. Sci. Math. Forum* **2024**, *9*, 2.
https://doi.org/10.3390/cmsf2024009002

**AMA Style**

Zare S, Nguyen HV.
Frustratingly Easy Environment Discovery for Invariant Learning. *Computer Sciences & Mathematics Forum*. 2024; 9(1):2.
https://doi.org/10.3390/cmsf2024009002

**Chicago/Turabian Style**

Zare, Samira, and Hien Van Nguyen.
2024. "Frustratingly Easy Environment Discovery for Invariant Learning" *Computer Sciences & Mathematics Forum* 9, no. 1: 2.
https://doi.org/10.3390/cmsf2024009002