No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Chand, Shireen; Baca, Faith; Ferrara, Emilio

doi:10.3390/ai7010024

Open AccessArticle

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

by

Shireen Chand

^†,

Faith Baca

^*,† and

Emilio Ferrara

^*

Thomas Lord Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2026, 7(1), 24; https://doi.org/10.3390/ai7010024

Submission received: 23 November 2025 / Revised: 24 December 2025 / Accepted: 8 January 2026 / Published: 13 January 2026

(This article belongs to the Special Issue Understanding Transformers and Large Language Models (LLMs) with Natural Language Processing (NLP))

Download

Browse Figures

Review Reports Versions Notes

Abstract

Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful outputs. While various techniques aim to mitigate these biases, their effects are typically evaluated only along the targeted dimension, leaving cross-dimensional consequences unexplored. This work provides the first systematic quantification of cross-category spillover effects in LLM bias mitigation. We evaluate four bias mitigation techniques (Logit Steering, Activation Patching, BiasEdit, Prompt Debiasing) across ten models from seven families, measuring impact on racial, religious, profession-, and gender-related biases using the StereoSet benchmark. Across 160 experiments yielding 640 evaluations, we find that targeted interventions cause collateral degradations to model coherence and performance along debiasing objectives in 31.5% of untargeted dimension evaluations. These findings provide empirical evidence that debiasing improvements along one dimension can come at the cost of degradation in others. We introduce a multi-dimensional auditing framework and demonstrate that single-target evaluations mask potentially severe spillover effects, underscoring the need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.

Keywords:

AI; LLMs; fairness; bias

1. Introduction

Large Language Models (LLMs) have emerged as a transformative and rapidly evolving technology, now embedded in a wide range of applications that influence how we access information, create content, and interact with the digital world. However, their increasing adoption is accompanied by a fundamental challenge: LLMs trained on large corpora of human-generated content inherit and frequently exacerbate deeply ingrained societal prejudices regarding race, gender, religion and other sensitive categories [1,2]. The risk of these models reinforcing harmful stereotypes is a critical barrier to their safe and fair adoption, making the development of effective bias mitigation techniques a central focus of AI research [3,4,5].

Numerous mitigation strategies have been proposed, ranging from data debiasing and constrained decoding to fine-tuning and parameter editing [6,7,8,9]. However, the evaluation of these techniques often suffers from a critical methodological flaw: a narrow, one-dimensional focus. For example, a technique designed to mitigate gender bias is typically judged a success if it reduces a gender bias metric, with little to no scrutiny of its potential for collateral damage or negative spillover into other bias domains. This oversight represents the specific research gap our work addresses. While the prior literature has theorized about such unintended consequences and described them as a “Ripple Effect Trap” [10] or a “Butterfly Effect” [11], there has been no large-scale, systematic quantification of these cross-dimensional trade-offs for modern LLMs.

This paper addresses this gap by introducing and implementing a comprehensive auditing framework designed to systematically uncover these effects. Our primary methodological innovation is the shift from a single-target evaluation to a multi-dimensional audit that treats any intervention as a systemic change to be evaluated in its entirety. Our framework is structured into four phases: (1) establishing a multi-dimensional performance baseline for a given model; (2) applying a single, targeted debiasing intervention for one specific dimension; (3) evaluating the intervened model’s performance not just on the target dimension but across all unmitigated “spillover” dimensions; and (4) analyzing the resulting data to quantify the trade-offs between on-target efficacy and collateral harm. This framework allows us to rigorously test our “No Free Lunch” hypothesis: that due to the entangled nature of conceptual representations within LLMs, targeted interventions on singular bias dimensions can cause unintended side effects on other, unmitigated bias dimensions.

To execute this audit, we conduct a large-scale empirical study with a carefully selected set of components. We analyze a diverse set of ten language models, including 1–7B models, as well as both foundational “base” models and “instruction-tuned” variants. We test four distinct post hoc debiasing techniques that represent different families of intervention: Prompt Debiasing (an input based method), Logit Steering and Activation Patching (in-flight geometric interventions on model activations) and BiasEdit (a direct parameter editing method). Our evaluation is grounded in the StereoSet benchmark, where we operationalize bias as “stereotypical preference” via its Stereotype Score (SS) and model quality as “linguistic coherence” via its Language Modeling Score (LMS). While this provides a robust and replicable measure, we acknowledge upfront that our conclusions are scoped to this specific operationalization of bias and may not capture all its forms.

Our contributions are as follows:

We conduct a comprehensive study of four post hoc debiasing techniques (Logit Steering, Activation Patching, BiasEdit and Prompt Debiasing) across ten language models, creating a robust and generalizable body of evidence.
We find consistent and statistically significant evidence for our “No Free Lunch” hypothesis. Targeted debiasing frequently causes causes significant collateral damage or “spillover” into untargeted dimensions, in some cases causing more harm than the original intervention sought to fix.
We present our methodology as a necessary framework (cf., Figure 1) for the responsible evaluation of bias mitigation techniques, advocating for the adoption of multi-dimensional analysis as a standard practice in the field.

2. Related Work

2.1. Trade-Offs in Algorithmic Success and AI

Bias mitigation efforts in LLMs often involve targeted interventions designed to reduce specific undesirable behaviors in models. While these interventions can be effective on their intended objectives, they can propagate in complex and often unpredictable ways to produce unintended effects elsewhere. We formalize this idea by positing the problem of the “Butterfly Effect” in AI: small, targeted interventions can trigger cascading and unpredictable consequences in the broader system’s behavior [11].

The presence of such cascading effects points to a broader concern about trade-offs in complex systems: improvements in one aspect of a system often come at a cost to others. This intuition resonates with the “No Free Lunch” (NFL) theorem for optimization originally proposed by Wolpert and Macready in 1997 [12], which presents a formal analysis of how algorithms perform across different problem classes. The theorem states that, for a given search or optimization algorithm, any gains in performance on one class of problems are necessarily offset by losses in performance on another class of problems; in other words, there is no universally optimal algorithm when performance is averaged across all possible problem distributions. As NFL is grounded in optimization theory, its formal assumptions about problem classes and algorithmic analysis across distributions do not apply directly to bias mitigation in LLMs; yet, the theorem provides a useful conceptual lens for investigating and reasoning about trade-offs in complex systems. We thus draw on this idea to analyze whether trade-offs analogous to those presented by NFL arise empirically when bias mitigation is applied across multiple social dimensions in LLMs, and we investigate whether successful reductions in bias along one dimension are offset by increases in bias or similar degradations along other dimensions and metrics.

2.2. Cross-Dimensional Effects and Fairness Trade-Offs

The concept of trade-offs naturally extends from broad algorithmic contexts to specific applications in fairness and bias mitigation. Fairness literature has long established that satisfying multiple fairness criteria is often mathematically impossible [13]. Kearns et al. [14] show that ensuring fairness for independent demographic groups does not guarantee fairness for their intersections. Similarly, studies of intersectional bias [15,16,17,18] highlight that biases do not exist independently of each other: certain combinations of demographic attributes produce unique stereotypical associations that are not captured when each dimension is considered in isolation. Extending the idea of trade-offs, Wang et al. [19] observe in multi-task learning that optimizing for fairness or accuracy on one task can degrade performance on other tasks, illustrating that multi-dimensional trade-offs are a general property of complex predictive systems. This body of work underscores both the complexity of bias and the inherent challenges in addressing it.

The study of trade-offs specifically in the domain of LLM fairness is relatively underexplored, especially when discussing how bias dimensions intersect and interact. While some work, such as Lu et al. [20], observes effects of debiasing one dimension on others, these analyses are limited in scope and do not systematically measure cross-dimensional spillover across multiple bias categories. In most cases, existing evaluations focus on a single targeted bias or overall model performance, leaving the broader landscape of cross-category interactions largely unexamined. Our work addresses this gap by introducing a replicable framework for systematically auditing debiasing techniques in LLMs across models, dimensions, and metrics. We consider multiple axes of bias to better understand their interactions, demonstrating that interventions along one dimension often produce unexpected effects on others, echoing the spirit of trade-offs emphasized by NFL.

2.3. Bias Benchmarks

Various recent surveys [21,22,23] offer extensive overviews of the different sources of bias, starting from historical representation in training data to algorithmic processing and the different mathematical definitions of fairness.

To quantify these biases, a variety of benchmarks have been developed. For example, datasets like CrowS-Pairs [24] and WinoBias [25] measure bias through paired sentences that differ only by a demographic term; the BOLD dataset [26] evaluates bias in open-ended text generation across a vast number of prompts. Bias evaluation has also been extended into other realms such as question answering [27] and Vision Language Models [28].

Our work adopts the StereoSet benchmark [29], which is uniquely suited to our research goals. Unlike binary choice datasets, StereoSet presents example contexts paired with triplets of sentences (stereotype, anti-stereotype, unrelated) which allows for the disentanglement of a model’s linguistic coherence from its stereotypical behaviors. This is critical given the documented trade-offs between fairness and model performance in complex learning scenarios [19]. Furthermore, it has been shown that catastrophic forgetting [30] is a significant challenge for neural networks and LLMs in both learning and unlearning tasks [31,32,33]; thus, measuring how debiasing affects model coherence is of essence. Additionally, StereoSet’s multi-dimensional nature, covering race, gender, religion and profession, is also a prerequisite for our investigation into the cross-dimensional effects of bias mitigation.

2.4. Existing Mitigation Techniques

A significant body of work has focused on mitigating bias during the model’s initial training or a subsequent full fine-tuning phase. These methods aim to embed fairness more fundamentally into the model’s parameters. Techniques include data augmentation with counterfactual examples [34], re-weighting training examples to reduce the influence of biased data, and resource-intensive methods like Reinforcement Learning from Human Feedback (RLHF) to steer models toward less harmful behavior [32,35]. Architectural analyses, as performed by Leteno et al. [36], try to identify the specific components responsible for encoding bias and give insights that can inform future model design. The Fair Class Balancing technique by Yan and collaborators [37] demonstrates a method for rebalancing training data not on the sensitive attributes themselves, but on automatically discovered proxy attributes, hence improving group fairness. While potentially more robust, these methods are highly computationally expensive, require access to large datasets, and involve a full training pipeline.

A promising middle ground between full retraining and pure inference time methods is the field of model editing. These techniques make surgical, computationally efficient modification to the weights of a pre-trained model to alter a specific behavior. Techniques like ROME [38], MEMIT [39] and BiasEdit [40] fall under this category. They are a better alternative to full fine-tuning, but still require direct access to the model’s parameters.

In contrast to training based methods, inference based (or post hoc) techniques are computationally cheap and model agnostic. The foundational idea of representing social bias as a linear direction in an embedding space was introduced by Bolukbasi et al. [6] for static word embeddings. The authors demonstrated that certain biases, such as gender, could be identified via PCA on the difference vectors of definitional pairs (e.g., “he” vs. “she”) and subsequently removed via geometric projection. The Logit Steering and Activation Patching techniques are direct applications of this projection method to the hidden states of modern Transformer models. However, Hila Gonen and Yoav Goldberg [41] find that while these techniques may successfully remove the projection of bias, they tend to leave the clustering of biased concepts intact in the vector space. Thus, debiasing may often operate at a superficial level and, while effective on the surface, may fail to eliminate the underlying structures responsible for bias emergence.

While the aforementioned techniques have been shown to be generally effective at reducing bias on their target dimension, their evaluation on untargeted dimensions is often overlooked. Most studies measure the reduction of a specific bias and may track its effect on overall model capabilities like perplexity. However, the potential for collateral damage where an intervention on one bias axis inadvertently introduces or exacerbates bias on another is a critical but underexplored area. This paper directly addresses this gap. By applying a suite of interventions and measuring their effects across all four StereoSet dimensions, we provide a complete analysis of their true costs, suggesting trade-off dynamics consistent with the spirit of the “No Free Lunch” principle.

3. Methodology

3.1. Dataset: StereoSet

To evaluate model bias across multiple domains, we utilize the StereoSet dataset [29], a widely recognized benchmark designed specifically to measure stereotypical biases in pre-trained language models. While more recent benchmarks exist, StereoSet’s structure is uniquely suited for our research. Its inclusion of an “unrelated” completion allows for the direct measurement of a model’s linguistic coherence (via the LMS Score), which is a critical component for testing our hypothesis that debiasing can harm a model’s fundamental capabilities. Other benchmarks that focus solely on stereotype preference would not allows us to disentangle these effects.

For our experiments, we use the intersentence portion of StereoSet data and the benchmark’s accompanying metrics to probe models’ coherence and preferences for stereotypical sentences.

3.1.1. Scope and Dimensions

The StereoSet dataset is organized into four distinct domains of social bias, allowing us to investigate the central thesis of our paper: that mitigating bias in one dimension can exacerbate it in others. The validations set used in our evaluation contains a total of 2123 examples, distributed across four dimensions (race, gender, religion, profession). More specifically, after we apply very minor filtering, there are 242, 827, 976, and 78 gender, profession, race, and religion intersentence entries, respectively. We acknowledge the significant imbalance in the number of examples per dimension, particularly the small size of the religion set. This disparity means that metrics calculated for religion are likely to be less stable and have wider confidence intervals than those for race. Our analysis accounts for this by relying on aggregated trends and statistical significance tests across all models, rather than drawing strong conclusions from any single data point from the smaller dimensions.

3.1.2. Structure of an Intersentence StereoSet Entry

Each data point consists of a context sentence and a set of three carefully constructed stereotype, anti-stereotype, and unrelated completions. The stereotype completions are designed to reflect existing social stereotypes while the anti-stereotype completions subvert these biases. The unrelated completions are included to evaluate the linguistic coherence of models.

An example entry taken from the StereoSet data’s race category is as follows:

Context: “My neighbor is Hispanic.”
Stereotype: “He doesn’t speak English.”
Anti-stereotype: “He went to college and is a teacher.”
Unrelated: “Dogs have funny tails.”

3.1.3. Evaluation Metrics

The Language Modeling Score (LMS) serves as a measure of a model’s fundamental linguistic competence. In the context of intersentence StereoSet data, LMS evaluates a model’s ability to distinguish between meaningful and unrelated sentence completions. For each example, a model is considered successful if it assigns a higher probability to either the stereotypical or anti-stereotypical completion than to the unrelated completion. The final score is the percentage of examples where this condition is met.

The LMS is calculated as:

LMS = 100 \times \frac{\sum_{i = 1}^{N} I (m a x (P_{s t e r e o, i}, P_{a n t i- s t e r e o, i}) > P_{u n r e l a t e d, i})}{N}

where N is the total number of examples and I is the indicator function. For each example i,

P_{stereo, i}

,

P_{anti-stereo, i}

, and

P_{unrelated, i}

denote the model-assigned probabilities of the stereotypical, anti-stereotypical, and unrelated completions, respectively. A higher LMS indicates a more coherent model that better understands the context, and the LMS of an ideal model is 100.

The next evaluation metric is the Stereotype Score (SS), which directly measures the model’s bias by quantifying its preference for stereotypical associations. It is calculated as the percentage of examples in which the model assigns a higher probability to the stereotypical completion over the anti-stereotypical one.

The SS is calculated as:

SS = 100 \times \frac{\sum_{i = 1}^{N} I (P_{s t e r e o, i} > P_{a n t i- s t e r e o, i})}{N}

A score of 100 indicates a complete preference for stereotypical associations, while a score of 0 indicates a complete preference for anti-stereotypical ones. An ideally unbiased model would demonstrate no preference, yielding an SS of 50.

To provide a single, holistic measure that balances linguistic competence with fairness, we use the Idealized Correlation Association Test (ICAT) score. The ICAT score combines LMS and SS, rewarding models that are both knowledgeable (high LMS) and unbiased (SS close to 50). The score is formulated to penalize models that are biased in either the stereotypical or anti-stereotypical direction through its fairness component,

\frac{m i n (S S, 100 - S S)}{50}

. This term is maximized at 1 when SS is 50 and drops to 0 when SS is either 0 or 100.

The ICAT score is calculated as:

ICAT = L M S \times \frac{m i n (S S, 100 - S S)}{50}

The ICAT score ranges from 0 to 100 and satisfies several desirable axioms:

An ideal model with perfect coherence (LMS = 100) and no bias (SS = 50) achieves an ICAT score of 100.
A fully biased model (SS = 0 or SS = 100) achieves an ICAT score of 0, regardless of its LMS.
A random-guess model (LMS = 50, SS = 50) achieves an ICAT score of 50.

For the purpose of this evaluation, a model’s “preference” for a given completion is defined by the probability it assigns to that completion. A higher probability indicates a stronger preference. Our evaluation relies on comparing the model’s assigned probabilities to these three completions to determine its underlying biases and linguistic capabilities.

3.2. Models

In order to comprehensively evaluate bias and mitigation techniques, we conduct experiments across a diverse set of transformer-based LLMs. This range allows us to observe how model features influence baseline biases and to quantify the efficacy of bias mitigation methods. The models used in our study are summarized in Table 1.

These models were chosen to represent a broad spectrum of model characteristics including size and architecture. The selection enables us to determine whether bias mitigation techniques are more effective for certain models, and to analyze how cross-category bias spillover manifests across different LLM families.

3.3. Bias Mitigation Techniques

To investigate the trade-offs of targeted debiasing, we implement four distinct techniques. While all are applied post hoc without full retraining, they represent three different families of intervention: Geometric Interventions that manipulate activations in-flight, Model Editing Interventions that make surgical modifications to model weights, and Input-Based Interventions that modify the prompt.

3.3.1. Bias Direction Computation via PCA

To perform targeted interventions, we must first represent an abstract bias concept as a concrete direction in the model’s activation space. We adopt the methodology pioneered by Bolukbasi and collaborators [6] for word embeddings and adapt it for contextual language models.

We begin by selecting contrastive pairs for each bias dimension that represent the poles of the bias axis (e.g., (“He is”, “She is”) for gender, (“Black person”, “White person”) for race). Each text in a pair is fed through the model, and we extract the final-layer hidden state representations. The hidden state for each text is averaged across all token positions to produce a single vector.

For each pair, we compute the difference between the two resulting vectors (e.g.,

{\vec{h}}_{“ He is ”} - {\vec{h}}_{“ She is ”}

. This creates a set of difference vectors, each pointing along a slightly different instantiation of the bias axis. To find the single, most dominant direction of variance across all difference vectors, we perform Principal Component Analysis (PCA) and extract the first principal component.

The resulting vector is normalized to have a unit length, giving a pure directional vector,

{\vec{v}}_{b i a s}

that represents the core axis of the targeted bias within the model’s activation space. The computed bias vector

{\vec{v}}_{b i a s}

serves as the basis for Logit Steering and Activation Patching.

3.3.2. Geometric Interventions

The following bias mitigation techniques operate by geometrically projecting out the pre-computed bias direction from the model’s hidden state activations during the forward pass.

Logit Steering (Projection-Based Debiasing)

Logit Steering is an inference-time intervention implemented via a forward hook. Following common practice in the literature, we attach the hook to the penultimate layer of the model, as this is the final representation layer that directly influences the output logits and is therefore a high-leverage point for intervention. The hyperparameter

α

controls the strength of the intervention. For all our experiments, we use

α

= 1.0, representing a full removal of the projected bias component. This choice was made not to find an optimal balance, but to apply a strong and consistent intervention across all models to maximally probe for the existence and magnitude of potential side-effects.

During the forward pass, for each hidden state vector

\vec{h}

produced by this layer, we perform a linear projection to remove the component that aligns with the bias direction:

{\vec{h}}_{d e b i a s e d} = \vec{h} - α \cdot {proj}_{{\vec{v}}_{b i a s}} (\vec{h})

where

{proj}_{{\vec{v}}_{b i a s}} (\vec{h})

is the projection of

\vec{h}

onto the bias vector

{\vec{v}}_{b i a s}

. The hyperparameter

α

is a scaling factor that controls the strength of the intervention. For all our experiments, we use

α = 1.0

, representing a full removal of the projected bias component.

Activation Patching

Activation Patching is a more aggressive form of projection-based debiasing, repurposing a technique from the field of mechanistic interpretability [38]. While the mathematical operation is identical to Logit Steering, the key difference is its scope [42].

Based on the hypothesis that bias is a distributed representation across multiple layers, this technique applies the same “projection and subtraction” intervention simultaneously to several layers of the model. Drawing from interpretability research suggesting that complex concepts are processed across the final layers of a network, we apply the hook to the final five layers of each model. This constitutes a more forceful patch intended to test whether a broader intervention can more thoroughly remove the bias signal as it propagates through the network. As with Logit steering, we use an intervention strength of

α = 1.0

.

3.3.3. Prompt Debiasing (Input Based Intervention)

Unlike the geometric interventions, Prompt Debiasing is an input-based technique that uses the instruction-following capabilities of modern LLMs [35,43]. This method involves prepending a carefully crafted instructional prompt to the original StereoSet context before it is processed by the model.

The prompts explicitly instruct the model to avoid the targeted bias and provide examples of neutral behavior. For example, the prompt for the gender dimension begins:

“Please respond without gender stereotypes or bias. Treat all genders equally. Examples of unbiased statement: -An engineer could be any gender……”

A critical aspect of our methodology is the evaluation of this technique. To ensure we are only measuring the model’s performance on the actual task completion, the tokens corresponding to our instructional prompt are masked and excluded from the Negative Log-Likelihood (NLL) calculation. This isolates the effect of the conditioning prompt on the model’s output quality.

3.3.4. BiasEdit (Parameter Editing)

BiasEdit is a targeted model editing approach that modifies a small subset of a model’s parameters to reduce stereotypical bias while preserving overall language modeling performance [40]. The method has been shown to successfully reduce racial, religious, and gender-related biases in transformer-based LLMs while minimally affecting downstream task performance, but there is no investigation of its cross-dimension effects.

The technique works by employing lightweight editor networks that generate parameter updates for specific model components. Based on preliminary bias tracing experiments, Xu et al. [40] conclude that stereotypical associations tend to be concentrated in the MLP layers of transformer blocks with co-occurrences being captured in lower layers. Additional results determine which specific layers are optimal for debiasing. For consistency, we implement the technique on the penultimate layer of each model to balance intervention effectiveness with minimal disruption to overall model performance.

To train the editor networks, we utilize the same StereoSet examples as in other methods with a 8:1 train-dev split for each dimension. The editing process is guided by two loss functions: a symmetric debiasing loss that encourages models to assign equal probability to StereoSet’s stereotypical and anti-stereotypical completions, and a retention loss that preserves language modeling capabilities by attempting to maintain predictions on neutral completions. Thus, critically, BiasEdit’s goal is not simply to reduce stereotypical bias within models, but to achieve equal distributions between stereotypical and anti-stereotypical predictions while maintaining coherence. While the approach is defined for intrasentence data, we adapt it to handle StereoSet’s intersentence examples to reflect our goal of evaluating and understanding the manifestation bias across complex contexts. This process yields a model specifically adapted based on anti-stereotypes from a single bias dimension.

4. Auditing Framework

4.1. Stage 1: Baseline Performance Calculation

The initial and most important phase is the establishment of a performance baseline for each model. This provides the reference point against which all changes are measured. The pre-trained language model is loaded and run on the StereoSet dataset without any debiasing interventions active. The evaluation is performed independently for each of the four bias dimensions. The raw LMS, SS and ICAT scores for each dimension are calculated and saved.

4.2. Stage 2: Intervention Application and Evaluation

For the geometric techniques, the bias direction vector

({\vec{v}}_{b i a s}

) for the target dimension is computed using PCA as described in Section 3.3.1. For BiasEdit, the necessary weight modifications are calculated. Next, the specific debiasing technique is activated. For Logit Steering and Activation Patching, the appropriate forward hooks are registered on the model’s layers. For BiasEdit, the pre-calculated weight changes are applied to the model. For Prompt Debiasing, the relevant instructional prompt is prepared for prepending to the input.

4.3. Stage 3: Multi-Dimensional Evaluation

With the intervention active for the chosen target dimension, the model is evaluated on the StereoSet benchmark across all four evaluation dimensions using LMS, SS, and ICAT. This process is designed to capture not only the intended effects but also the unintended collateral damage central to our “No Free Lunch” thesis.

5. Results

We now present an empirical analysis of targeted bias mitigation in LLMs, focusing on its effectiveness on both intended and untargeted dimensions. Our analysis addresses three key questions: (1) Does targeted debiasing succeed on its intended dimension? (2) What collateral effects occur on unmitigated dimensions? (3) How do these patterns vary by technique, model, and dimension?

We conducted 160 unique debiasing experiments, evaluating the language models across 4 techniques and 4 target dimensions. Each experiment was audited by measuring its impact across all 4 fairness dimensions, resulting in 640 total evaluations. We define “spillover” as a change in any of the three metrics (LMS, SS, ICAT) on a dimension which is different from the targeted dimension for mitigation. Additionally, we define “harm” specifically as a degradation in model behavior, operationalized as an increase in SS or a decrease in LMS or ICAT. All reported changes in metrics are measured relative to the original, unmodified model.

Using ICAT score as our measure of a model’s overall utility, our results reveal that targeted interventions achieved a statistically significant improvement in the on-target ICAT score in only 20.6% cases. Conversely, these same interventions caused statistically significant collateral damage, worsening the ICAT score on unmitigated, spillover dimensions in 31.5% of all spillover evaluations. In summary, our results indicate that, within the scope of our investigation, unintended cross-dimensional degradations in performance are more frequent than successful improvements on the targeted dimension.

5.1. Systemic Trade-Offs in Bias Mitigation

Our primary finding is that bias mitigation is not a localized fix but a systemic intervention with far-reaching consequences. The Heatmap in Figure 2 summarizes this phenomenon by showing the average change in the model’s overall utility (ICAT score) for every target–evaluation pair, illustrating how interventions propagate across dimensions.

The results are striking and reveal the potential for a pattern of systemic harm. The most dominant feature is the prevalence of negative (blue) values, indicating that these interventions, on average, damage the model’s overall quality. This is true not only for off-target “collateral damage” but for the on-target intervention itself.

For example, consider the case where profession is both the target and evaluation dimension, which shows a statistically significant (t(39) = −2.22, p < 0.05) average ICAT drop of −3.16. This means that the techniques applied to “fix” profession bias were so harmful to the model’s core linguistic capabilities that they made the model significantly worse at handling the topic of professions: In essence, the cure was worse than the disease. Similarly, targeting race bias led to an average on-target ICAT drop of −1.70 while also causing significant collateral damage to the model’s performance on profession with a drop of −3.20 (t(39) = −2.28, p < 0.05).

The Scatter Plot in Figure 3 confirms this is not an artifact of averaging. Each point represents a single debiasing run, with the x-axis indicating the change in SS on the targeted dimension and the y-axis indicating the corresponding change on non-targeted dimensions. While some interventions achieve simultaneous improvements on both axes, a large number (35 data points) fall into the quadrant characterized by improved on-target performance alongside degraded off-target performance. This distribution indicates that improvements on one dimension are frequently associated with adverse changes on others consistent with the broader intuition underlying NFL-style trade-offs in complex systems, rather than isolated, independent gains.

5.2. Dimension-Specific Debiasing Success

Our analysis displays substantial variation in debiasing success across the four dimensions: some dimensions proved quite amenable to intervention while others resulted in significant increases in bias levels, as displayed in Figure 4.

Religion emerged as the most spillover-susceptible evaluation dimension, exhibiting both the top beneficial and top adverse spillovers; when race was the target dimension, one experimental run resulted in the largest reduction in SS of nearly 14%, while another experimental run resulted in an increase in SS of over 23%. One potential explanation for this volatility is that religion could be highly entangled with other dimensions of bias and that models may lack the capability to representationally distinguish racial bias, for example, from religious bias.

Gender as an evaluation dimension follows closely behind in terms of this pattern. Another potential explanation relates to the imbalance in dimensions across StereoSet data; since gender and religion examples comprise less than 12% and 4% of the dataset respectively, all metrics may become more sensitive to small changes thus amplifying observed spillovers. This highlights the importance of balance in dimensional composition in future efforts to create bias benchmarks.

The beneficial spillovers warrant extra scrutiny as successful reduction of SS does not require that a model maintains its coherence. The top three most beneficial cross-category spillovers were the result of applying the BiasEdit technique. In two of these runs, applying the technique increased LMS, but in the last one, LMS decreased by more than 20%. This shows that mitigating bias along one dimension can result in significant, unintended consequences unrelated to the main goal of debiasing—both good and bad—along other dimensions. This is consistent with the catastrophic forgetting notion discussed by Kirkpatrick et al. [30]. Furthermore, our results contradict those presented by Lu et al. [20] which suggest that spillover from mitigation is usually beneficial; the largest increase in SS from our experiments is nearly double the largest decrease in SS, and nearly half of the experimental runs result in decreases in SS in untargeted dimensions. These results, in conjunction with those presented in Figure 2, reveal an overall susceptibility within our tested models and techniques to adverse spillover effects.

The asymmetric pattern of spillovers suggests that bias mitigation techniques seeking to reduce bias along one dimension at a time may be insufficient. Real-world biases are complex and often represented intersectionally [15,16] in LLMs [17,18]. Thus, future work in fairness must address these concerns to accurately represent real-world biases. A potential avenue for exploration to reduce spillover effects is debiasing models sequentially so that dimensions are addressed in order of their independence of other dimensions. Addressing the issue of cross-dimension spillover is critical to ensure progress toward fairer LLMs.

5.3. Analysis of Bias Mitigation by Technique

In addition to examining the cross-dimension spillover effects, we analyze each technique’s success in mitigating bias along intended dimensions. A successful experimental run is defined here as a reduction in SS. Additionally, Figure 5 displays the distributions of change in SS across models for target dimension reduction.

BiasEdit was overall the most successful debiasing technique, reducing SS along intended dimensions in 72.5% of experimental runs. Nonetheless, Figure 5 shows that BiasEdit also displayed the largest range in SS change by far, suggesting that, while the technique may be successful in reducing bias in many cases, its efficacy is highly model- and dimension-dependent. The high variability echoes broader concerns about the brittleness of parameter editing techniques [33], which may overfit to specific patterns in limited training data. Additionally, our analysis contrasts the original results presented in [40] of implementing the method on intrasentence data, indicating that intersentence complexity is also a significant factor in the technique’s variability. Therefore, it is critical to develop debiasing methods that support intersentence data to reflect real-world language and biases more faithfully.

Logit-steering was the least successful method overall, reducing SS in only 35% of runs. This suggests its intervention is often too weak to overcome the model’s pre-existing biases. Activation Patching and Prompt Debiasing occupy a middle ground, succeeding 42.5% and 45.0% of the time, respectively, with both contributing to modest average decreases in SS.

5.4. Model Analysis

To tie our analyses together, we examine how debiasing interacts with model architecture. Figure 6 displays the changes in SS and LMS averaged across technique and dimension types. The figure also makes clear that editing resulted in changes in bias and model coherence that varied substantially between models.

Generally, models with fewer parameters display larger drops in LMS, indicating that smaller models were much more susceptible to losses in coherence resulting from intervention. This is likely because smaller models rely more heavily on compact, intertwined representations of language, meaning that any slight perturbation—including debiasing along a singular dimension—can be highly damaging to the internal structure responsible for general language modeling capabilities. Decreases in LMS occurred in seven out of the ten models after debiasing was applied.

Both Gemma-2b and DeepSeek-7b displayed increases in SS after debiasing. These models may encode biases in ways that are inaccessible to our debiasing techniques. Qwen-3B presents a puzzling case: while LMS is decreased after intervention, SS is increased, implying that the model became both more biased and less coherent overall. This behavior shines light on the scarcity of our understanding about internal representations of bias in LLMs, and further work is needed to thoroughly assess how complex biases manifest in varying model architectures.

5.5. Statistical Significance Testing

To distinguish true effects from statistical noise, we employ two promary statistical tests. When reporting the significance of an individual experimental run (a single ICAT_diff), the p-value is derived from a simulated independent samples t-test, comparing the distribution of scores from the baseline model to the intervened model, as described in our methodology. When analyzing the significance of an average effect across all model and technique (e.g., the average ICAT_diff in a heatmap cell), we use a one-sample t-test. This test determines if the mean of our sample of ICAT_diff values for a given condition is significantly different from zero, which represents the null hypothesis of “no effect”. Throughout our analysis, we use a standard significance threshold of p < 0.05.

6. Discussion

Our central finding that targeted interventions frequently cause harm to unmitigated dimensions can be explained by the entangled nature of conceptual representations within LLMs [44]. Our results strongly suggest that a model does not learn “gender”, “race”, etc., as discrete, orthogonal concepts. Instead, these are overlapping, co-dependent subspaces learned from a training corpus where they are deeply linked.

The vulnerability of the religion dimension to spillover is a prime example. In Western-based training data, discussions of religion are intertwined with gender roles, ethnic identities and specific professions [45]. Consequently, when an intervention forcefully alters the model’s representation of gender or race, it is not adjusting an isolated variable but disturbing a thread that runs through many other concepts. The resulting collateral damage is not a bug but an emergent feature of the entangled knowledge.

Beyond this conceptual description, our findings hint at a more technical, mechanistic explanation. The consistent, systemic drop in the Language Model Score (LMS) across nearly all interventions suggests that these techniques are causing a form of localized catastrophic forgetting. A post hoc intervention, particularly a powerful one like BiasEdit, may be analogous to a targeted adversarial attack on a specific conceptual subspace. By aggressively shifting the parameters or activations responsible for one concept, the intervention may corrupt the weights and representations that are shared with other, unrelated concepts due to principles like superposition [46]. This forces the model into an out-of-distribution state, degrading its ability to calculate coherent probabilities and thus harming its fundamental linguistic capabilities. The collateral damage is therefore not just a “swapping” of biases, but a degradation of the model’s core competence.

Our results prove that evaluating a debiasing technique solely on its intended target is insufficient and misleading since it might appear successful but may be silently amplifying other harms. New auditing techniques for LLMs are emerging [47,48]: We argue that the use of a multi-dimensional auditing framework such as the one we proposed in this paper should become a standard practice. Before deploying any bias mitigation technique, practitioners must perform a comprehensive evaluation to map its full impact, measuring not only the intended effects but also the unintended spillover. Practitioners must move beyond asking “Did we fix the problem?” and instead ask “What was the total systemic impact of our fix?”.

Finally, we must acknowledge the limitations of our study. Our results and analysis are contingent on the StereoSet benchmark which, as pointed out by Blodgett and collaborators [49], has some known limitations: They argue that fairness benchmarks like StereoSet inevitably encode a specific set of societal norms and stereotypes reflective of their place and time of creation, which in this case is modern, English-speaking cultures. The associations it labels as “stereotypical” may not be universally applicable across different global or historic contexts. Other critiques shine light on additional concerns about the validity of StereoSet’s data in terms of aspects ranging from spelling and grammar to the inaccuracy of claims that the stereotypes represented in the benchmark actually reflect harmful biases rather than innocuous biases or contextual ambiguities [50].

Therefore, while our findings demonstrate bias spillovers within the StereoSet framework, we cannot definitively claim that these exact trade-offs will generalize to all real-world applications. Future work is essential to validate these trade-offs in real-world applications and across more culturally-aware benchmarks. First, we will expand to newer social bias benchmarks such as BBQ [27], which highlights attested biases against people belonging to protected classes along nine social dimensions. Furthermore, complementary to bias benchmarks like CrowS-Pairs and StereoSet, RealToxicityPrompts [51] targets generative toxicity, providing prompts and scoring methods to quantify how frequently language models produce toxic continuations in realistic settings: it will be well worth exploring whether other forms of alignment, e.g., harm mitigation, could lead to unintended exacerbation of other harm dimensions.

7. Conclusions

Our study systematically investigated the cross-category effects of targeted bias mitigation techniques in LLMs, providing both a multi-dimensional audit framework and empirical evidence for the systemic trade-offs inherent in debiasing interventions. By applying four mitigation methods (Logit Steering, Activation Patching, BiasEdit, and Prompt Debiasing) across ten transformer-based LLMs and evaluating their impact on four bias dimensions (gender, profession, religion, and race) using the StereoSet benchmark, we offer a methodologically grounded assessment of both intended and unintended effects of debiasing.

7.1. Key Empirical Findings

Our results demonstrate that targeted debiasing frequently causes collateral harm, often worsening overall model utility:

Targeted interventions improved on-target ICAT scores in only 20.6% of cases, while causing statistically significant spillover harm in 31.5% of evaluations, representing a rate more than 1.5× higher than on-target success.
Many dimensions were susceptible to spillover with debiasing resulting in untargeted reductions in SS as large as 13.35% and increases in SS as large as 23.11%.
Smaller models (≤2B parameters) experienced larger coherence losses (LMS drops) than larger models, highlighting how capacity constraints can exacerbate fairness–accuracy trade-offs.
BiasEdit, while most effective at reducing on-target SS (72.5% of runs), exhibited the highest variance across dimensions and models, showing that parameter-editing methods are sensitive to architectural and distributional factors.

These findings provide strong empirical support for our hypothesis in the spirit of the No Free Lunch principle: mitigating bias along one dimension often comes with trade-offs that negatively affect other dimensions and overall model performance.

7.2. Methodological Contributions

Our multi-dimensional audit framework extends evaluation beyond single-target metrics, systematically quantifying cross-dimensional spillovers that would otherwise remain hidden and have not been analyzed in previous works.
The framework can be adopted or extended during the development of debiasing techniques to better assess effectiveness before deployment while integrating both fairness and coherence metrics.

7.3. Scope Limitations

We recognize a few limitations that constrain generalization of our findings:

Benchmark specificity: Results are based solely on StereoSet, reflecting Western cultural norms, dimensional imbalances, and known issues with stereotype validity.
Techniques tested: Our experiments test four techniques that range in nature from model editing to post hoc interventions, but further techniques including full fine-tuning should be investigated as well.
Model coverage: Ten models from seven families (1–7B parameters) were tested; behavior in larger models (>70B), different architectures, or non-English models remains unexplored.
Bias operationalization: Metrics capture distributional biases in sentence completion but do not cover allocation harm, representational harm, or downstream task disparities.

7.4. Practical and Research Implications

We present the following suggestions for practitioners in the future of LLM development:

Evaluate debiasing interventions across all relevant bias dimensions, not just the target.
Monitor linguistic coherence alongside fairness metrics to avoid overcorrection and catastrophic forgetting, and to observe the overall systemic effects of debiasing beyond simply removing bias.
Recognize that techniques effective on larger models may cause disproportionate harm in smaller models.

For researchers investigating bias in LLMs, concrete directions derived from our findings include:

Multi-dimensional mitigation: Develop methods that account for correlations between dimensions, e.g., joint optimization across fairness objectives, sequential debiasing ordered by dimensional independence, or constrained editing that preserves non-target representations.
Mechanistic understanding: Identify which transformer components encode cross-dimensional associations and develop interventions targeting only responsible components, building on architectural bias tracing and interpretability tools.
Benchmark improvement: Construct evaluation benchmarks with balanced representation, cultural diversity, and intersectional examples to robustly test mitigation strategies.

Ultimately, our study reinforces that every debiasing intervention comes with trade-offs, and that there is No Free Lunch in bias mitigation. Someday, these insights may guide the development of LLMs and, more broadly, AI systems that balance trade-offs thoughtfully to become fairer, more reliable, and more reflective of the communities they serve.

Author Contributions

Conceptualization, Methodology, Validation, Formal analysis, Investigation, Resources: S.C., F.B. and E.F.; Software, Data curation: S.C. and F.B.; Writing—original draft: S.C., F.B. and E.F.; Writing—review & editing, Visualization: S.C., F.B. and E.F.; Supervision, Project administration, Funding acquisition: E.F. All authors have read and agreed to the published version of the manuscript.

Funding

This project was in part supported by the NSF (Award Number 2331722).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

We would like to thank the members of the HUMANS Lab at USC for their feedback and support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Weidinger, L.; Uesato, J.; Rauh, M.; Griffin, C.; Huang, P.S.; Mellor, J.; Glaese, A.; Cheng, M.; Balle, B.; Kasirzadeh, A.; et al. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 214–229. [Google Scholar]
Gallegos, I.O.; Rossi, R.A.; Barrow, J.; Tanjim, M.M.; Kim, S.; Dernoncourt, F.; Yu, T.; Zhang, R.; Ahmed, N.K. Bias and fairness in large language models: A survey. Comput. Linguist. 2024, 50, 1097–1179. [Google Scholar] [CrossRef]
Blodgett, S.L.; Barocas, S.; Daume, H., III; Wallach, H. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5454–5476. [Google Scholar]
Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. Ethical and social risks of harm from language models. arXiv 2021, arXiv:2112.04359. [Google Scholar] [CrossRef]
Ferrara, E. Should ChatGPT be biased? Challenges and risks of bias in large language models. First Monday 2023, 28, 13346. [Google Scholar] [CrossRef]
Bolukbasi, T.; Chang, K.W.; Zou, J.Y.; Saligrama, V.; Kalai, A.T. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Proceedings of the Advances in Neural Information Processing Systems 29 (NeurIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 4349–4357. [Google Scholar]
Bordia, S.; Bowman, S.R. Identifying and Reducing Gender Bias in Word-Level Language Models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Minneapolis, MN, USA, 3–5 June 2019; pp. 7–15. [Google Scholar] [CrossRef]
Lauscher, A.; Lueken, T.; Glavaš, G. Sustainable Modular Debiasing of Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 4782–4797. [Google Scholar] [CrossRef]
Lin, Z.; Guan, S.; Zhang, W.; Zhang, H.; Li, Y.; Zhang, H. Towards trustworthy LLMs: A review on debiasing and dehallucinating in large language models. Artif. Intell. Rev. 2024, 57, 243. [Google Scholar] [CrossRef]
Selbst, A.D.; Boyd, D.; Friedler, S.A.; Venkatasubramanian, S.; Vertesi, J. Fairness and Abstraction in Sociotechnical Systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, Atlanta, GA, USA, 29–31 January 2019; pp. 59–68. [Google Scholar] [CrossRef]
Ferrara, E. The Butterfly Effect in artificial intelligence systems: Implications for AI bias and fairness. Mach. Learn. Appl. 2024, 15, 100525. [Google Scholar] [CrossRef]
Wolpert, D.; Macready, W. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef]
Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Berkeley, CA, USA, 9–11 January 2017; pp. 43:1–43:23. [Google Scholar]
Kearns, M.; Neel, S.; Roth, A.; Wu, Z.S. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. arXiv 2018, arXiv:1711.05144. [Google Scholar] [CrossRef]
Crenshaw, K. Mapping the Margins: Intersectionality, Identity Politics, and Violence against Women of Color. Stanf. Law Rev. 1991, 43, 1241–1299. [Google Scholar] [CrossRef]
Guo, W.; Caliskan, A. Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, Virtual, 19–21 May 2021; pp. 122–133. [Google Scholar] [CrossRef]
Ma, W.; Chiang, B.; Wu, T.; Wang, L.; Vosoughi, S. Intersectional Stereotypes in Large Language Models: Dataset and Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Singapore, 2023; pp. 8589–8597. [Google Scholar] [CrossRef]
Souani, B.; Soremekun, E.; Papadakis, M.; Yokoyama, S.; Chattopadhyay, S.; Traon, Y.L. HInter: Exposing Hidden Intersectional Bias in Large Language Models. arXiv 2025, arXiv:2503.11962. [Google Scholar] [CrossRef]
Wang, Y.; Wang, X.; Beutel, A.; Prost, F.; Chen, J.; Chi, E.H. Understanding and Improving Fairness-Accuracy Trade-offs in Multi-Task Learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, Virtual, 14–18 August 2021; pp. 1748–1757. [Google Scholar] [CrossRef]
Lu, H.; Isonuma, M.; Mori, J.; Sakata, I. Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation. arXiv 2024, arXiv:2407.16951. [Google Scholar] [CrossRef]
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. ACM Comput. Surv. 2021, 54, 115. [Google Scholar] [CrossRef]
Pessach, D.; Shmueli, E. A Review on Fairness in Machine Learning. ACM Comput. Surv. 2022, 55, 51. [Google Scholar] [CrossRef]
Ferrara, E. Fairness and bias in artificial intelligence: A brief survey of sources, impacts, and mitigation strategies. Sci 2024, 6, 3. [Google Scholar]
Nangia, N.; Vania, C.; Bhalerao, R.; Bowman, S.R. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1953–1967. [Google Scholar] [CrossRef]
Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; Chang, K.W. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers); Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 15–20. [Google Scholar] [CrossRef]
Dhamala, J.; Sun, T.; Kumar, V.; Krishna, S.; Pruksachatkun, Y.; Chang, K.W.; Gupta, R. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, Virtual, 3–10 March 2021; pp. 862–872. [Google Scholar] [CrossRef]
Parrish, A.; Chen, A.; Nangia, N.; Padmakumar, V.; Phang, J.; Thompson, J.; Htut, P.M.; Bowman, S. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 2086–2105. [Google Scholar]
Wang, S.; Cao, X.; Zhang, J.; Yuan, Z.; Shan, S.; Chen, X.; Gao, W. Vlbiasbench: A comprehensive benchmark for evaluating bias in large vision-language model. arXiv 2024, arXiv:2406.14194. [Google Scholar]
Nadeem, M.; Bethke, A.; Reddy, S. StereoSet: Measuring Stereotypical Bias in Pre-trained Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 5356–5371. [Google Scholar] [CrossRef]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Nguyen, T.T.; Huynh, T.T.; Ren, Z.; Nguyen, P.L.; Liew, A.W.C.; Yin, H.; Nguyen, Q.V.H. A Survey of Machine Unlearning. ACM Trans. Intell. Syst. Technol. 2025, 16, 108. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Halevy, K.; Sotnikova, A.; AlKhamissi, B.; Montariol, S.; Bosselut, A. “Flex Tape Can’t Fix That”: Bias and Misinformation in Edited Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 8690–8707. [Google Scholar] [CrossRef]
Zmigrod, R.; Mielke, S.; Wallach, H.; Cotterell, R. Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; pp. 1651–1661. [Google Scholar] [CrossRef]
Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI Feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
Leteno, T.; Gourru, A.; Laclau, C.; Gravier, C. An investigation of structures responsible for gender bias in BERT and DistilBERT. In Advances in Intelligent Data Analysis XXI, Proceedings of the International Symposium on Intelligent Data Analysis; Springer: Cham, Switzerland, 2023; pp. 249–261. [Google Scholar]
Yan, S.; Kao, H.T.; Ferrara, E. Fair class balancing: Enhancing model fairness without observing sensitive attributes. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 16–23 October 2020; pp. 1715–1724. [Google Scholar]
Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and Editing Factual Associations in GPT. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; pp. 17359–17372. [Google Scholar]
Meng, K.; Sharma, A.S.; Andonian, A.J.; Belinkov, Y.; Bau, D. Mass-Editing Memory in a Transformer. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Xu, X.; Xu, W.; Zhang, N.; McAuley, J. BiasEdit: Debiasing Stereotyped Language Models via Model Editing. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), Albuquerque, NM, USA, 3 May 2025; pp. 166–184. [Google Scholar] [CrossRef]
Gonen, H.; Goldberg, Y. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 609–614. [Google Scholar] [CrossRef]
Zhang, F.; Nanda, N. Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Schick, T.; Udupa, S.; Schutze, H. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Trans. Assoc. Comput. Linguist. 2021, 9, 1408–1424. [Google Scholar]
Caliskan, A.; Bryson, J.J.; Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 2017, 356, 183–186. [Google Scholar] [CrossRef]
Kirk, H.R.; Jun, Y.; Volpin, F.; Iqbal, H.; Benussi, E.; Dreyer, F.; Shtedritski, A.; Asano, Y. Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Adv. Neural Inf. Process. Syst. 2021, 34, 2611–2624. [Google Scholar]
Elhage, N.; Hume, T.; Olsson, C.; Schiefer, N.; Henighan, T.; Kravec, S.; Hatfield-Dodds, Z.; Lasenby, R.; Drain, D.; Chen, C.; et al. Toy Models of Superposition. arXiv 2022, arXiv:2209.10652. [Google Scholar] [CrossRef]
Amirizaniani, M.; Martin, E.; Roosta, T.; Chadha, A.; Shah, C. AuditLLM: A tool for auditing large language models using multiprobe approach. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 5174–5179. [Google Scholar]
Qiu, P.; Zhou, S.; Ferrara, E. Information suppression in large language models: Auditing, quantifying, and characterizing censorship in DeepSeek. Inf. Sci. 2026, 724, 122702. [Google Scholar] [CrossRef]
Blodgett, S.L.; Lopez, G.; Olteanu, A.; Sim, R.; Wallach, H. Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 1004–1015. [Google Scholar] [CrossRef]
Govil, P.; Jain, H.; Bonagiri, V.; Chadha, A.; Kumaraguru, P.; Gaur, M.; Dey, S. COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models. In Proceedings of the 17th ACM Web Science Conference 2025, Websci ’25, New Brunswick, NJ, USA, 20–24 May 2025; pp. 460–471. [Google Scholar] [CrossRef]
Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y.; Smith, N.A. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 3356–3369. [Google Scholar]

Figure 1. A Visual Representation of Our Auditing Framework and the “No Free Lunch” Principle. The process begins with a pre-trained LLM with entangled biases. A debiasing technique is applied to a single target dimension. The debiased model is then evaluated across all dimensions using the StereoSet benchmark.

Figure 2. Average Impact on Overall Score (ICAT). Each cell represents the average outcome of an intervention where the y-axis is the dimension being targeted for mitigation and x-axis is the dimension being evaluated. Blue cells indicate a negative average change (net harm to the model’s quality and fairness), while red cells indicate a positive change (net improvement).

Figure 3. Target Effectiveness vs. Spillover Impact (Stereotype Change). This scatter plot visualized the outcome of every unique debiasing intervention. The x-axis represents the on-target effectiveness, showing the change in the Stereotype Score on the dimension the intervention was designed to fix. The y-axis represents the collateral impact on untargeted dimensions. Dashed lines mark zero bias change on the target and spillover dimensions, dividing the plot into four outcome quadrants.

Figure 4. Dimension-specific debiasing spillover effects, showing cases with beneficial and adverse spillovers (reductions and increases in LMS, respectively). Both figures display the top spillovers per target-evaluation pair across all model and technique types.

Figure 5. Change in bias is quantified through SS_diff. SS_diff changes averaged across all models are displayed. Only experimental runs in which the target and evaluation dimensions match are shown.

Figure 6. Change in SS and LMS metrics by model type. Averages are computed over all technique and dimension types.

Table 1. Models Used in Experiments.

Family	Model	Parameters
Gemma	google/gemma-2b	2B
Gemma	google/gemma-7b	7B
OLMo	allenai/OLMo-1B-0724-hf	1B
OLMo	allenai/OLMo-2-1124-7B	7B
LLaMA	meta-llama/Llama-3.2-1B	1B
LLaMA	meta-llama/Llama-2-7b-hf	7B
Qwen	Qwen/Qwen2.5-3B-Instruct	3B
GPT-Neo	EleutherAI/gpt-neo-1.3B	1.3B
Mistral	mistralai/Mistral-7B-Instruct-v0.3	7B
Deepseek	deepseek-ai/deepseek-llm-7b-chat	7B

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chand, S.; Baca, F.; Ferrara, E. No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases. AI 2026, 7, 24. https://doi.org/10.3390/ai7010024

AMA Style

Chand S, Baca F, Ferrara E. No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases. AI. 2026; 7(1):24. https://doi.org/10.3390/ai7010024

Chicago/Turabian Style

Chand, Shireen, Faith Baca, and Emilio Ferrara. 2026. "No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases" AI 7, no. 1: 24. https://doi.org/10.3390/ai7010024

APA Style

Chand, S., Baca, F., & Ferrara, E. (2026). No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases. AI, 7(1), 24. https://doi.org/10.3390/ai7010024

Article Menu

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Abstract

1. Introduction

2. Related Work

2.1. Trade-Offs in Algorithmic Success and AI

2.2. Cross-Dimensional Effects and Fairness Trade-Offs

2.3. Bias Benchmarks

2.4. Existing Mitigation Techniques

3. Methodology

3.1. Dataset: StereoSet

3.1.1. Scope and Dimensions

3.1.2. Structure of an Intersentence StereoSet Entry

3.1.3. Evaluation Metrics

3.2. Models

3.3. Bias Mitigation Techniques

3.3.1. Bias Direction Computation via PCA

3.3.2. Geometric Interventions

Logit Steering (Projection-Based Debiasing)

Activation Patching

3.3.3. Prompt Debiasing (Input Based Intervention)

3.3.4. BiasEdit (Parameter Editing)

4. Auditing Framework

4.1. Stage 1: Baseline Performance Calculation

4.2. Stage 2: Intervention Application and Evaluation

4.3. Stage 3: Multi-Dimensional Evaluation

5. Results

5.1. Systemic Trade-Offs in Bias Mitigation

5.2. Dimension-Specific Debiasing Success

5.3. Analysis of Bias Mitigation by Technique

5.4. Model Analysis

5.5. Statistical Significance Testing

6. Discussion

7. Conclusions

7.1. Key Empirical Findings

7.2. Methodological Contributions

7.3. Scope Limitations

7.4. Practical and Research Implications

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI