No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
Abstract
1. Introduction
- We conduct a comprehensive study of four post hoc debiasing techniques (Logit Steering, Activation Patching, BiasEdit and Prompt Debiasing) across ten language models, creating a robust and generalizable body of evidence.
- We find consistent and statistically significant evidence for our “No Free Lunch” hypothesis. Targeted debiasing frequently causes causes significant collateral damage or “spillover” into untargeted dimensions, in some cases causing more harm than the original intervention sought to fix.
- We present our methodology as a necessary framework (cf., Figure 1) for the responsible evaluation of bias mitigation techniques, advocating for the adoption of multi-dimensional analysis as a standard practice in the field.
2. Related Work
2.1. Trade-Offs in Algorithmic Success and AI
2.2. Cross-Dimensional Effects and Fairness Trade-Offs
2.3. Bias Benchmarks
2.4. Existing Mitigation Techniques
3. Methodology
3.1. Dataset: StereoSet
3.1.1. Scope and Dimensions
3.1.2. Structure of an Intersentence StereoSet Entry
- Context: “My neighbor is Hispanic.”
- Stereotype: “He doesn’t speak English.”
- Anti-stereotype: “He went to college and is a teacher.”
- Unrelated: “Dogs have funny tails.”
3.1.3. Evaluation Metrics
- An ideal model with perfect coherence (LMS = 100) and no bias (SS = 50) achieves an ICAT score of 100.
- A fully biased model (SS = 0 or SS = 100) achieves an ICAT score of 0, regardless of its LMS.
- A random-guess model (LMS = 50, SS = 50) achieves an ICAT score of 50.
3.2. Models
3.3. Bias Mitigation Techniques
3.3.1. Bias Direction Computation via PCA
3.3.2. Geometric Interventions
Logit Steering (Projection-Based Debiasing)
Activation Patching
3.3.3. Prompt Debiasing (Input Based Intervention)
“Please respond without gender stereotypes or bias. Treat all genders equally. Examples of unbiased statement: -An engineer could be any gender……”
3.3.4. BiasEdit (Parameter Editing)
4. Auditing Framework
4.1. Stage 1: Baseline Performance Calculation
4.2. Stage 2: Intervention Application and Evaluation
4.3. Stage 3: Multi-Dimensional Evaluation
5. Results
5.1. Systemic Trade-Offs in Bias Mitigation
5.2. Dimension-Specific Debiasing Success
5.3. Analysis of Bias Mitigation by Technique
5.4. Model Analysis
5.5. Statistical Significance Testing
6. Discussion
7. Conclusions
7.1. Key Empirical Findings
- Targeted interventions improved on-target ICAT scores in only 20.6% of cases, while causing statistically significant spillover harm in 31.5% of evaluations, representing a rate more than 1.5× higher than on-target success.
- Many dimensions were susceptible to spillover with debiasing resulting in untargeted reductions in SS as large as 13.35% and increases in SS as large as 23.11%.
- Smaller models (≤2B parameters) experienced larger coherence losses (LMS drops) than larger models, highlighting how capacity constraints can exacerbate fairness–accuracy trade-offs.
- BiasEdit, while most effective at reducing on-target SS (72.5% of runs), exhibited the highest variance across dimensions and models, showing that parameter-editing methods are sensitive to architectural and distributional factors.
7.2. Methodological Contributions
- Our multi-dimensional audit framework extends evaluation beyond single-target metrics, systematically quantifying cross-dimensional spillovers that would otherwise remain hidden and have not been analyzed in previous works.
- The framework can be adopted or extended during the development of debiasing techniques to better assess effectiveness before deployment while integrating both fairness and coherence metrics.
7.3. Scope Limitations
- Benchmark specificity: Results are based solely on StereoSet, reflecting Western cultural norms, dimensional imbalances, and known issues with stereotype validity.
- Techniques tested: Our experiments test four techniques that range in nature from model editing to post hoc interventions, but further techniques including full fine-tuning should be investigated as well.
- Model coverage: Ten models from seven families (1–7B parameters) were tested; behavior in larger models (>70B), different architectures, or non-English models remains unexplored.
- Bias operationalization: Metrics capture distributional biases in sentence completion but do not cover allocation harm, representational harm, or downstream task disparities.
7.4. Practical and Research Implications
- Evaluate debiasing interventions across all relevant bias dimensions, not just the target.
- Monitor linguistic coherence alongside fairness metrics to avoid overcorrection and catastrophic forgetting, and to observe the overall systemic effects of debiasing beyond simply removing bias.
- Recognize that techniques effective on larger models may cause disproportionate harm in smaller models.
- Multi-dimensional mitigation: Develop methods that account for correlations between dimensions, e.g., joint optimization across fairness objectives, sequential debiasing ordered by dimensional independence, or constrained editing that preserves non-target representations.
- Mechanistic understanding: Identify which transformer components encode cross-dimensional associations and develop interventions targeting only responsible components, building on architectural bias tracing and interpretability tools.
- Benchmark improvement: Construct evaluation benchmarks with balanced representation, cultural diversity, and intersectional examples to robustly test mitigation strategies.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Weidinger, L.; Uesato, J.; Rauh, M.; Griffin, C.; Huang, P.S.; Mellor, J.; Glaese, A.; Cheng, M.; Balle, B.; Kasirzadeh, A.; et al. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 214–229. [Google Scholar]
- Gallegos, I.O.; Rossi, R.A.; Barrow, J.; Tanjim, M.M.; Kim, S.; Dernoncourt, F.; Yu, T.; Zhang, R.; Ahmed, N.K. Bias and fairness in large language models: A survey. Comput. Linguist. 2024, 50, 1097–1179. [Google Scholar] [CrossRef]
- Blodgett, S.L.; Barocas, S.; Daume, H., III; Wallach, H. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5454–5476. [Google Scholar]
- Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. Ethical and social risks of harm from language models. arXiv 2021, arXiv:2112.04359. [Google Scholar] [CrossRef]
- Ferrara, E. Should ChatGPT be biased? Challenges and risks of bias in large language models. First Monday 2023, 28, 13346. [Google Scholar] [CrossRef]
- Bolukbasi, T.; Chang, K.W.; Zou, J.Y.; Saligrama, V.; Kalai, A.T. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Proceedings of the Advances in Neural Information Processing Systems 29 (NeurIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 4349–4357. [Google Scholar]
- Bordia, S.; Bowman, S.R. Identifying and Reducing Gender Bias in Word-Level Language Models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Minneapolis, MN, USA, 3–5 June 2019; pp. 7–15. [Google Scholar] [CrossRef]
- Lauscher, A.; Lueken, T.; Glavaš, G. Sustainable Modular Debiasing of Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 4782–4797. [Google Scholar] [CrossRef]
- Lin, Z.; Guan, S.; Zhang, W.; Zhang, H.; Li, Y.; Zhang, H. Towards trustworthy LLMs: A review on debiasing and dehallucinating in large language models. Artif. Intell. Rev. 2024, 57, 243. [Google Scholar] [CrossRef]
- Selbst, A.D.; Boyd, D.; Friedler, S.A.; Venkatasubramanian, S.; Vertesi, J. Fairness and Abstraction in Sociotechnical Systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, Atlanta, GA, USA, 29–31 January 2019; pp. 59–68. [Google Scholar] [CrossRef]
- Ferrara, E. The Butterfly Effect in artificial intelligence systems: Implications for AI bias and fairness. Mach. Learn. Appl. 2024, 15, 100525. [Google Scholar] [CrossRef]
- Wolpert, D.; Macready, W. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef]
- Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Berkeley, CA, USA, 9–11 January 2017; pp. 43:1–43:23. [Google Scholar]
- Kearns, M.; Neel, S.; Roth, A.; Wu, Z.S. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. arXiv 2018, arXiv:1711.05144. [Google Scholar] [CrossRef]
- Crenshaw, K. Mapping the Margins: Intersectionality, Identity Politics, and Violence against Women of Color. Stanf. Law Rev. 1991, 43, 1241–1299. [Google Scholar] [CrossRef]
- Guo, W.; Caliskan, A. Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, Virtual, 19–21 May 2021; pp. 122–133. [Google Scholar] [CrossRef]
- Ma, W.; Chiang, B.; Wu, T.; Wang, L.; Vosoughi, S. Intersectional Stereotypes in Large Language Models: Dataset and Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Singapore, 2023; pp. 8589–8597. [Google Scholar] [CrossRef]
- Souani, B.; Soremekun, E.; Papadakis, M.; Yokoyama, S.; Chattopadhyay, S.; Traon, Y.L. HInter: Exposing Hidden Intersectional Bias in Large Language Models. arXiv 2025, arXiv:2503.11962. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, X.; Beutel, A.; Prost, F.; Chen, J.; Chi, E.H. Understanding and Improving Fairness-Accuracy Trade-offs in Multi-Task Learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, Virtual, 14–18 August 2021; pp. 1748–1757. [Google Scholar] [CrossRef]
- Lu, H.; Isonuma, M.; Mori, J.; Sakata, I. Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation. arXiv 2024, arXiv:2407.16951. [Google Scholar] [CrossRef]
- Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. ACM Comput. Surv. 2021, 54, 115. [Google Scholar] [CrossRef]
- Pessach, D.; Shmueli, E. A Review on Fairness in Machine Learning. ACM Comput. Surv. 2022, 55, 51. [Google Scholar] [CrossRef]
- Ferrara, E. Fairness and bias in artificial intelligence: A brief survey of sources, impacts, and mitigation strategies. Sci 2024, 6, 3. [Google Scholar]
- Nangia, N.; Vania, C.; Bhalerao, R.; Bowman, S.R. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1953–1967. [Google Scholar] [CrossRef]
- Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; Chang, K.W. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers); Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 15–20. [Google Scholar] [CrossRef]
- Dhamala, J.; Sun, T.; Kumar, V.; Krishna, S.; Pruksachatkun, Y.; Chang, K.W.; Gupta, R. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, Virtual, 3–10 March 2021; pp. 862–872. [Google Scholar] [CrossRef]
- Parrish, A.; Chen, A.; Nangia, N.; Padmakumar, V.; Phang, J.; Thompson, J.; Htut, P.M.; Bowman, S. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 2086–2105. [Google Scholar]
- Wang, S.; Cao, X.; Zhang, J.; Yuan, Z.; Shan, S.; Chen, X.; Gao, W. Vlbiasbench: A comprehensive benchmark for evaluating bias in large vision-language model. arXiv 2024, arXiv:2406.14194. [Google Scholar]
- Nadeem, M.; Bethke, A.; Reddy, S. StereoSet: Measuring Stereotypical Bias in Pre-trained Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 5356–5371. [Google Scholar] [CrossRef]
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, T.T.; Huynh, T.T.; Ren, Z.; Nguyen, P.L.; Liew, A.W.C.; Yin, H.; Nguyen, Q.V.H. A Survey of Machine Unlearning. ACM Trans. Intell. Syst. Technol. 2025, 16, 108. [Google Scholar] [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Halevy, K.; Sotnikova, A.; AlKhamissi, B.; Montariol, S.; Bosselut, A. “Flex Tape Can’t Fix That”: Bias and Misinformation in Edited Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 8690–8707. [Google Scholar] [CrossRef]
- Zmigrod, R.; Mielke, S.; Wallach, H.; Cotterell, R. Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; pp. 1651–1661. [Google Scholar] [CrossRef]
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI Feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
- Leteno, T.; Gourru, A.; Laclau, C.; Gravier, C. An investigation of structures responsible for gender bias in BERT and DistilBERT. In Advances in Intelligent Data Analysis XXI, Proceedings of the International Symposium on Intelligent Data Analysis; Springer: Cham, Switzerland, 2023; pp. 249–261. [Google Scholar]
- Yan, S.; Kao, H.T.; Ferrara, E. Fair class balancing: Enhancing model fairness without observing sensitive attributes. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 16–23 October 2020; pp. 1715–1724. [Google Scholar]
- Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and Editing Factual Associations in GPT. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; pp. 17359–17372. [Google Scholar]
- Meng, K.; Sharma, A.S.; Andonian, A.J.; Belinkov, Y.; Bau, D. Mass-Editing Memory in a Transformer. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Xu, X.; Xu, W.; Zhang, N.; McAuley, J. BiasEdit: Debiasing Stereotyped Language Models via Model Editing. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), Albuquerque, NM, USA, 3 May 2025; pp. 166–184. [Google Scholar] [CrossRef]
- Gonen, H.; Goldberg, Y. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 609–614. [Google Scholar] [CrossRef]
- Zhang, F.; Nanda, N. Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Schick, T.; Udupa, S.; Schutze, H. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Trans. Assoc. Comput. Linguist. 2021, 9, 1408–1424. [Google Scholar]
- Caliskan, A.; Bryson, J.J.; Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 2017, 356, 183–186. [Google Scholar] [CrossRef]
- Kirk, H.R.; Jun, Y.; Volpin, F.; Iqbal, H.; Benussi, E.; Dreyer, F.; Shtedritski, A.; Asano, Y. Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Adv. Neural Inf. Process. Syst. 2021, 34, 2611–2624. [Google Scholar]
- Elhage, N.; Hume, T.; Olsson, C.; Schiefer, N.; Henighan, T.; Kravec, S.; Hatfield-Dodds, Z.; Lasenby, R.; Drain, D.; Chen, C.; et al. Toy Models of Superposition. arXiv 2022, arXiv:2209.10652. [Google Scholar] [CrossRef]
- Amirizaniani, M.; Martin, E.; Roosta, T.; Chadha, A.; Shah, C. AuditLLM: A tool for auditing large language models using multiprobe approach. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 5174–5179. [Google Scholar]
- Qiu, P.; Zhou, S.; Ferrara, E. Information suppression in large language models: Auditing, quantifying, and characterizing censorship in DeepSeek. Inf. Sci. 2026, 724, 122702. [Google Scholar] [CrossRef]
- Blodgett, S.L.; Lopez, G.; Olteanu, A.; Sim, R.; Wallach, H. Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 1004–1015. [Google Scholar] [CrossRef]
- Govil, P.; Jain, H.; Bonagiri, V.; Chadha, A.; Kumaraguru, P.; Gaur, M.; Dey, S. COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models. In Proceedings of the 17th ACM Web Science Conference 2025, Websci ’25, New Brunswick, NJ, USA, 20–24 May 2025; pp. 460–471. [Google Scholar] [CrossRef]
- Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y.; Smith, N.A. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 3356–3369. [Google Scholar]






| Family | Model | Parameters |
|---|---|---|
| Gemma | google/gemma-2b | 2B |
| Gemma | google/gemma-7b | 7B |
| OLMo | allenai/OLMo-1B-0724-hf | 1B |
| OLMo | allenai/OLMo-2-1124-7B | 7B |
| LLaMA | meta-llama/Llama-3.2-1B | 1B |
| LLaMA | meta-llama/Llama-2-7b-hf | 7B |
| Qwen | Qwen/Qwen2.5-3B-Instruct | 3B |
| GPT-Neo | EleutherAI/gpt-neo-1.3B | 1.3B |
| Mistral | mistralai/Mistral-7B-Instruct-v0.3 | 7B |
| Deepseek | deepseek-ai/deepseek-llm-7b-chat | 7B |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Chand, S.; Baca, F.; Ferrara, E. No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases. AI 2026, 7, 24. https://doi.org/10.3390/ai7010024
Chand S, Baca F, Ferrara E. No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases. AI. 2026; 7(1):24. https://doi.org/10.3390/ai7010024
Chicago/Turabian StyleChand, Shireen, Faith Baca, and Emilio Ferrara. 2026. "No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases" AI 7, no. 1: 24. https://doi.org/10.3390/ai7010024
APA StyleChand, S., Baca, F., & Ferrara, E. (2026). No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases. AI, 7(1), 24. https://doi.org/10.3390/ai7010024

