1. Introduction
The use of large language models (LLMs) for tasks such as text generation and classification has experienced remarkable growth in recent years. For instance, since the introduction of Generative Pre-trained Transformer 1 (GPT-1) by OpenAI, Inc., San Francisco, United States in 2018, each successive generation of these models has significantly increased in size and complexity, attaining a new set of capability as a multimodal model released on May 13 2024, GPT-4o, capable of processing visual, textual, and audio inputs. Currently, GPT-based models alone boast over 200 million active users worldwide [
1]. The rapid adoption of prominent LLMs such as OpenAI’s ChatGPT, Meta’s LLAMA, and Microsoft’s CoPilot has surpassed even historically disruptive technologies such as the Internet and mobile phones. A recent survey found that by 2025, it is estimated that 50% of digital work in financial institutions will be automated using specialized LLMs, leading to faster decision-making and reduced operational costs. Industry experts anticipate that by 2030, LLMs will become a ubiquitous part of daily life worldwide.
In addition, the use of LLMs has infiltrated a large part of human life, with diverse research showing uses in healthcare, such as clinical decision support system (CDSS) for mental health diagnosis using LLMs ([
2]), and the application of LLMs in labor market analytics for understanding job opportunities in [
3], as well as LLMs being used for code authorship attribution, aiding in software forensics and plagiarism detection [
4]. Despite their remarkable capabilities, LLMs exhibit significant alignment issues with human values, particularly due to inherent biases. These biases are well-documented in academic research; for example, Kotek et al. (2023) [
5] demonstrated gender biases concerning occupational roles.
1.1. The Problem: Bias in LLMs
1.1.1. Bias: Definition
According to Nissenbaum et al. [
6], a computer system is biased if it both unfairly and systematically discriminates against one group in favour of another. Further, the three overarching categories comprise our typology of bias: pre-existing social bias, technical bias, and emergent social bias.
More recently, bias classifications for LLMs in the scientific community have been broadly categorized into intrinsic bias and extrinsic bias based on the different stages at which the biases manifest within the model’s lifecycle and the type of bias being measured (Doan et al., 2024) [
7]:
Intrinsic bias refers to biases that are inherently within the internal representations or outputs of a trained or pre-trained LLM and are independent of any specific downstream tasks.
Extrinsic bias refers to the biases that manifest during the model’s performance on specific downstream tasks after training or fine-tuning.
1.1.2. Sources and Current State of Bias in LLMs
Sources of bias in AI can arise from different stages of the machine learning pipeline, including data collection, algorithm design, and user interactions. This survey discusses the different sources of bias in AI and provides examples of each type, including data bias, algorithmic bias, and user bias [
8,
9].
The origins of bias were further elucidated by [
10] as follows:
Data bias occurs when the data used to train machine learning models are unrepresentative or incomplete, leading to biased outputs.
Algorithmic bias, on the other hand, occurs when the algorithms used in machine learning models have inherent biases that are reflected in their outputs.
User bias occurs when the people using AI systems introduce their own biases or prejudices into the system, consciously or unconsciously.
When presented with biased prompts, ChatGPT demonstrates the most notable increase in the proportion of female-prejudiced news articles for AI-generated content, as shown in the research by [
11].
Another example of biased LLM decision-making is pointed out by [
12], where the findings show consistent bias by the theoretical physician LLM models towards patients with specific demographic characteristics, political ideology, and sexual orientation.
Ref. [
13] highlights biases relating ethnicity to valence tasks in GPT-4. Additional research has raised critical social and ethical concerns regarding these models [
14,
15]. Such biases typically originate from historical biases embedded in training datasets, as well as inappropriate model correlations between unrelated or non-causal data points and outcomes. Moreover, these biases are compounded by model hallucinations, resulting from inadequate verification and validation mechanisms.
1.2. The Motivation
Given the established presence of distributional biases within LLMs, particularly impacting marginalized communities [
16], our research addresses the following questions:
RQ1: Are LLMs more sensitive to direct or indirect adversarial prompts exhibiting bias?
RQ2: Does adding explicit context to prompt instructions significantly mitigate biased outcomes, or is its impact limited in cases of deeply ingrained biases?
RQ3: Do optimization techniques such as quantization amplify biases or circumvent built-in model moderation mechanisms?
Motivated by these critical questions, our study systematically evaluates three prominent open-source frontier models—LLAMA-2, MISTRAL-7B, and Gemma—using structured benchmarking methodologies. We analyse variations across different modalities and prompt formulations to identify biases and underlying factors contributing to these discrepancies.
1.3. The Importance
Given how widely used these models are, the presence of bias in LLMs can have significant ethical and social implications for our society. Biased predictions and recommendations generated by these models can reinforce stereotypes, perpetuate discrimination, and amplify existing inequalities in society [
17]. Through unverified misinformation and fake news generated by these models, there will be stronger polarization in the world, and harm can be propagated by bad actors on innocent civilians [
18].
2. Literature Review
Early instances of racial and gender bias have been reported by researchers consistently, as found by Bolukbasi et al. (2016) in the way machine learning (specifically NLP applications) models respond [
19]. These biases have been attributed to the data on which these models are trained, and as a consequence, the inferences learned are non-causally connected to biased outcomes. Machine learning models trained on biased data have been found to perpetuate and even exacerbate the bias against historically under-represented and disadvantaged demographic groups when deployed [
20]. Moreover, as models are continually implemented in critical fields as recommendation systems for social justice and employment screening, the downstream network effects are dangerous, as exemplified in the healthcare industry by how structural bias and discrimination results in inappropriate care and has negative health effects [
21].
For the purposes of understanding how fairness has been dissected and interpreted by the AI community, we conducted a literature survey of existing methodologies, benchmarks, and surveys as can be accessed in
Table 1.
4. Experimental Setup
4.1. Model Selection
The models we have chosen are based on cost to access, popularity among users, and performance on model evaluation metrics. From an original pool of eight models, we shortlisted the final three on the basis of the above parameters.
Note: This research was conducted only on open-source models and that is why models like ChatGPT and Claude were excluded, as they operate on a freemium subscription model and would slow down the speed of testing due to funding constraints.
4.1.1. Mistral 7B [30]
Mistral 7B is a 7.3B parameter transformer model that, in its architecture design, uses grouped-query attention (GQA) for faster inference and also uses Sliding Window Attention (SWA) to handle longer sequences at a smaller cost. It is released under the Apache 2.0 license and can be used without restrictions, The pre-trained version performs comparatively to LLAMA-2 according to Mistral documentation on metrics similar to inference metrics such as MMLU, WinoGrande, Commonsense Reasoning, World Knowledge, and Reading Comprehension.
We used one pre-trained base model [
30] and one fine-tuned version [
31] of the Mistral 7B for comparative evidence; we did not proceed with any instruction-tuned version of the models since they are highly moderated.
4.1.2. LLAMA-2 7B [32]
Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. We used LLAMA 2-7b for our research due to its performance and compute requirements. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pre-training nor the fine-tuning datasets include Meta user data. While the model slightly underperformed on the parameters that Mistral7B performed, it was also evaluated on benchmark metrics like TruthfulQA and Toxigen, which focus on transparency and fairness of the model; the performance reported is quite good as can be seen in
Table 2.
The website says Llama2 is an instruction-tuned model trained on an offline dataset, and hence the responses in theory should likely be moderated. Similar to Mistral, we used one pre-trained base model [
32] and one fine-tuned version trained over Guanaco [
33] for comparative evidence.
4.1.3. GEMMA-2 9B [34]
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. The 9B model was trained with 8 trillion tokens. Gemma2 is tested on a variety of toxicity- and fairness-analogous model benchmarks and the tests can be accessed below in
Table 3.
What is interesting here is that the pre-trained version is not shown as tested on these metrics, only the instruction-tuned version. When we tested the instruction-tuned version of Gemma 2, we realized that it was moderated, as every prompt was getting censored by likely filter models set on top of the base model. We make this inference as the results for the Gemma 2 9B pre-trained version are more comparable with other models, as opposed to the ones showcased by Gemma 2 IT 9B.
4.2. Experimental Procedure
After deciding on our models, we explored the ways we can access the model as explained in
Table 4 and based on literature research as well as our experimentation, we used the following strategy to interact with the models for text generation for the task of question-answering on biased queries:
We started with an analysis on how the models respond to general prompts and studied the response time and coherence.
Based on the preliminary analysis, we tweaked the expected size length of the generated response to optimize the time taken for responses to be generated and the appropriate brevity for coherent responses.
As the aim of this testing is to gauge whether the models have overt or inherent biases that can be experienced by the users based on the type of prompt content, we have identified the ways this can be evaluated. There are 2 major ways these can be determined:
Internal access to the model training data and weights, and as this is not available to the public/student research community for any of these models due to trademark and proprietary reasons, even though they are open-sourced. There is also the reason that even though it is possible to access the weights of some of these models, there is a limitation on compute resources for us to be able to do a comparative analysis, and as a consequential result this route was not explored.
Test the models on a variety of scenarios and check for differential answers based on biased prompt context vs. a control prompt.
Based on the above learnings and constraints, we chose a simpler method of looping instead of batching libraries as they were also found to be skipping prompts in the prompt list and turned out to be a lot slower than a simple loop function.
Models such as Llama2 and Gemma 2 are gated models, and after requesting access via the HuggingFace website, we used the transformer library to access the models, after creating the authentication token which is used while calling the models through the AutoTokenizer and AutoModelForCausalLM functions.
To ensure the pipeline is using the pre-trained base models, we set values for pre-training_tp to 1, and use_cache as False in the modelċonfig function.
Further, to reduce hallucination for most iterations of the bias tests, we disconnected the GPU and deleted all in-session variables to restart the model training afresh and then prompted the models.
Currently, all conversations are zero-shot prompting; we want to push this further with multi-turn or multi-shot conversations to understand model bias, and this will be one of the future explorations of the project.
Table 4.
Table shows the model access route approaches.
Table 4.
Table shows the model access route approaches.
Model Access Approach | Benefit | Result |
---|
API calls using the authentication token from HuggingFace platform | Easy, fast, and does not require any GPU resources | Not explored, as responses are moderated for an online and instruction-tuned version of the model, thereby hiding active biases in the models. |
API calls from the model platform (like Gemini for Gemma) | Easy, fast, and does not require any GPU resources | Not explored, as responses are moderated for an online and instruction-tuned version of the model, thereby hiding active biases in the models. |
Local install for each model in the python environment on a local server, using PyTorch 1.3.1. or TensorFlow 2.0+ plugins | This approach provides unmoderated responses on pre-trained base models, with slightly complex pipelining, but it is reliable, as the GPU access does not time out such as in Google Colab and Kaggle. | The results were time-intensive, and hence we did not move forward with this approach. |
Google Colab environment by using LangChain for interfacing with the models to create a pipeline. | Batching of prompt experiments for multiple model families possible with the same pipeline. | Models took more time for computing, and often some prompts are missed in the process of instruction tuning. |
The transformers library by HuggingFace is used for a local install in the Google Colab environment. | Repeatable results which are not moderated, fairly fast. | Dependency on Google GPU access, as the models are fairly large to upload in notebook memory and sometimes they crash. |
Quantization techniques and the Accelerate library in Google Colab environment. | Quick, repeatable, and unmoderated results. | Reducing dependency on Google GPU access, as this approach uses less compute power without significantly affecting result quality. |
Fine-tuned versions of the LLM models in Google Colab environment. | Compact models reduce training time and compute requirements. | Low visibility of impact on model through fine-tuning for fairness metrics. Hence, fine-tuned model selection is currently decided based on popularity/download frequency. |
4.3. Post-Processing and Inference Analysis
We analysed the responses and studied for frequency of repetition from the perspective of model behaviour. Breaking down each type of prompt response, we have clustered the prompt results as seen in
Figure 2.
At this stage, the classification on the type of bias was done manually to justify the human supervision element of the testing process. In the future, with enough empirical data using this nomenclature, it is possible to automate this process but will require strong guardrails on the quality of classification being done.
After collecting all the prompt responses in individual ċsv files, we processed them into a separate colab instance where we plotted our results over metrics such as the decision (i.e., confused, repetitive, etc.), model family used (Llama, Gemma etc.), or model type (Fine-tuned, base pre-trained, etc).
5. Results
We implemented a three-pronged approach with our experimentation. First, we tested on three types of models; the tests were conducted through the implementation of three different bias types, namely disability status, ethnicity, and gender; and finally we experimented with the model access route and its parameters.
What we wanted to achieve with direct bias prompts for the gender stereotype, with our blunt question “What is my gender?” attached to each prompt, was to gauge how easily the model would respond with a biased response.
Although many of the responses fell into the category of ‘confused’ and ‘repeat’, a significant portion, i.e., approximately 45 percent of the responses, were biased as observed in
Figure 3,
Figure 4 and
Figure 5.
Even then, in all models it is evident that the response will be half as likely to be classified in the ‘confuse’ or ‘repeat’ bucket, which is notable. To further understand the reason for prompt repetition and confused/hallucinating prompts, we checked whether the size of the prompts was the reason for this behavior and tested whether the lengthier or shorter prompts were generating either of these categories remarkably in
Figure 6,
Figure 7 and
Figure 8.
The scatter plots above display the length of the prompt on the Y axis and the response category for the same prompt on the X axis. Our inference from studying these graphical representations was that they were uniformly spread and did not lean towards one side for all models unanimously. This implies that length of prompt is not conclusively behind repeated or biased responses, so we can safely assume it does not affect model misbehaviour.
5.1. Model Benchmarking: Who’s the Fairest of Them All?
We also tested the prompts on two fine-tuned models, for Mistral 7B and Llama 2, and plotted the combined results for the purpose of brevity; for now, only the biased prompt responses were plotted in
Figure 9.
It was interesting to note that while the quantity of biased prompt results decreased significantly for Mistral’s fine-tuned version as compared to its base pre-trained version, the opposite happened for Llama.
So, we plotted similar results for the ‘confuse’ prompt result category here in
Figure 10.
As is evident, many of Mistral fine-tuned model’s generated inferences (roughly 70 percent) have been flagged as ‘confuse’ prompt responses. This shows that while the model decreased stereotypically biased responses, the results did not necessarily get better.
Further, while based on this information, Gemma seems like a better option, when we look at the ‘repeat’ category below in
Figure 11, we observe it has the highest ratio. Even so, the quantity of the prompts is comparatively less, so in this case Gemma is doing fairly well. In addition, while the fine-tuned version of Llama did poorly on the bias prompts, it has the least number of ‘repeats’ and ‘confuse’, meaning that the Llama model will definitely swing with an opinion, even if it is in the wrong direction.
The combined graph below
Figure 12 provides an overview of the three models and their prompt results; we can observe that the models on a general basis for direct gender bias prompts will be 45 percent likely to be stereotyped, 15 percent likely to repeat the instruction prompt, and 33 percent likely to be confused in the response (the remainder is divided among other categories).
We conducted the same tests for indirectly biased data, and in this category we had prompts of all three bias types available: disability, ethnicity, and gender. We have decided to focus on the prompts that were classified as biased and have grouped them in the plots along with the fine-tuned model results for a better visual understanding.
An interesting behaviour noticed is that the fine-tuned versions of the models were worse in performance as opposed to base pre-trained models, but we also need to consider that the values for ‘confuse’ and ‘repeat’ may have been consequently adjusted. Llama’s fine-tuned version demonstrates markedly lower efficiency for all three categories, and is less stable in comparison to Mistral.
5.2. Context vs. No-Context: Soft or Hard Biases for Indirect Prompts
We conducted a deeper exploration of the results for the indirect prompt tests, and were able to study the bifurcation for context and no-context results, and the results have been reflected in
Figure 13,
Figure 14 and
Figure 15.
What needs to be gleaned from the indirect prompt dataset is that soft biases in the models will be uncovered by no-context prompts as the instruction prompts make slightly leading statements and there is no scenario context or factual weight attached. However, context prompts uncover hard biases, as they have at least one of the two—scenario context or factual weight—and even if in this state the models are giving biased judgments, we can safely conclude that the hardwired biases overtake the logical reasoning that would be attributed to a rational model.
The results demonstrate that the fine-tuned models perform significantly worse than the pre-trained versions when it comes to disability bias, especially when there is no context provided. The pre-trained models perform slightly better for ethnicity, except for Mistral’s Fine-tuned version, where accuracy is visibly higher. For gender indirect prompts, the pre-trained models perform much better than the fine-tuned versions. What is interesting to note is that when context is provided, across the board the models perform better and give less biased results. However, it also indicates that the few biases that are captured here are hard-coded and should be addressed by the model developers.
5.3. Bias Type Across the Models in Terms of Discrimination
Surprisingly, the results for context vs. no-context prompts show that the fine-tuned models are more biased and unfair for no-context prompts when compared to their predecessors, especially in the case of disability. To understand which biases are more prevalent across the models, we plotted the graphs based on bias types between Mistral and Llama (as Gemma we tested only on pre-trained), and disability status has the highest bias response expectation. This is likely because disability biases are not documented, and awareness—even through the training dataset on which these models performed—is not high in comparison to highly visible—at least online—aspects like gender and ethnicity discrimination, but poignant research is being conducted in this area through focused study groups and by observing model behaviour for the differently-abled population [
35].
We showcase the results for fine-tuned vs. pre-trained models on the biased prompts above in
Figure 16 and
Figure 17.
We did not showcase the Gemma results here, as they were available only for the pre-trained model at the time of experimentation and documentation, and also fairly constant across the bias types, and comparable with Mistral and Llama values.
Through the study of these prompt results, we have identified that on an average fine-tuned models do perform slightly better than pre-trained versions overall, but the amount of stereotypical biases and incorrect inferences rooted in hard preconceived biases in the models are nowhere near where we can consider them fixed through instruction fine-tuning and model unlearning practices. They are piecemeal efforts when it comes to covert and implicit biases that are hiding in the system. The reason this issue remains prevalent is because these biases then get translated into undetected discrimination against marginalized communities when models are implemented for far-reaching applications like customer-facing chat agents, healthcare applications, career opportunity portals, and public services.
6. Inferences
While we have quantified the model responses through our nomenclature and categorization scheme to give us more tangibility, we need to remember that these are real biases steeped into the societal systems and affect real people. Recurring themes in the testing that we noticed were around people with mental health issues, where the models assumed that people with these disorders are predisposed to antisocial behavior, even with context pointing to the other direction; similar bias was observed against the socio-economic and health status of people of color, and a unanimous predisposition against queer people (transgender) was noticed even if the prompt instruction did not mention them directly, for example, the very mention of a cisgender woman resulted in a response generation that was biased against transgender people.
The Gemma 2 instruction-tuned model has been tested on the indirect prompts from the BBQ dataset and seems to have more than 88% accuracy, which we confirmed with our pipeline, as almost every response was deflecting from answering the question, and hence we did not include it as part of the experiments. The reason is that as the model is instruction-tuned, we are observing a lot of moderation and censoring of the model responses, hence access to the unvarnished responses is not available. We saw a similar behavior for LangChain instantiated fine-tuned models for Llama. Due to the new nature of fairness testing, we could not find research that explains this behavior. We estimate that the model architecture of these models is using a filtering model as a top-layer on the model responses, thereby stopping the model from responding to harmful prompts. While this works to reduce harm caused to customer-facing applications, it does not change the covert and inherent biases present in the model. Another possible reason that we have identified is that the way the task prompts are designed, where the model has to choose between four options, thereby giving it leverage, and when we used the same prompts in a general scenario without the handicap of options to choose from, they do not perform as well. This incentivizes models to get away with suboptimal fairness application while also seemingly signaling with very good benchmark results. The paper [
26] on benchmark cheating also discusses this trend among popular LLM developers.
Due to compute restrictions, we were not able to comprehensively compare results between quantized and non-quantized models, and even with a lot of effort, results were generated only for Mistral, and we observed similar biases in the model responses, so for the purposes of this research our conclusion is that the model response quality is not significantly affected because of quantization. A similar case is made by Dettmers et al. [
36].
The focus on bias prompt results is intentional in this research, since even if the model is generating anti-stereotypical bias responses, we are getting an answer that is not expected neutral knowledge from the model, and hence it is worth studying. A lot of responses were bucketed into the ‘confuse’ category, the reason being that even if the model is hallucinating or just responding with coherent, rational answers, ultimately these responses do not indicate conclusively towards the existence of bias, hence it is not a focus area.
8. Discussion
8.1. Industry Understanding of Model Biases
To purge models of biased views, model developers use feedback training, in which human workers manually adjust the way the model responds to certain prompts. This process, often called “alignment”, aims to recalibrate the millions of connections in the neural network and get the model to conform better with desired values. The prevalent belief is that these methods work well to combat overt stereotypes, and leading companies have employed them for nearly a decade.
However, these methods have consistently failed on detecting covert stereotypes. The study [
38] contains research making a similar finding, where these biases were elicited when using a dialect of English in comparison to standard English in their study. These model unlearning approaches are more popular for ease of implementation reasons, as coaching a model not to respond to overtly biased questions is much easier than coaching it not to respond negatively to an entire bias type with all possible iterations. Another incentive, aside from complexity, is economic reasoning, as the task of retraining the model from scratch will result in loss of market advantage and also will take a lot of funds, as corroborated by [
39], where authors talk about how retraining LLMs is prohibitively expensive as opposed to model unlearning techniques.
8.2. Important Insights and Conclusion
To conclude the article, we revisit our original research questions:
RQ1: Are LLMs more sensitive to direct or indirect adversarial prompts exhibiting bias?
RA1: Based on the prompt analysis, LLMs are more sensitive to directly biased adversarial prompts, as they form roughly 45% of biased responses in comparison, but indirectly contextualized prompts are also longer, and hence a significant portion of the prompts are classified in confused category, unlike the case in direct adversarial prompts.
RQ2: Does adding explicit context to prompt instructions significantly mitigate biased outcomes, or is its impact limited in cases of deeply ingrained biases?
RA2: For indirect prompts, no-context prompts are significantly more biased than contextualized prompts, which refers to underlying, inherent hard bias based on training data and model learnings.
RQ3: Do optimization techniques such as quantization amplify biases or circumvent built-in model moderation mechanisms?
RA3: Results were generated only for Mistral, and we observed similar biases in the model responses, so for the purposes of this research our conclusion is that the model response quality was not significantly affected because of quantization. With access to stronger compute, future iterations of this research can comprehensively answer this question.
One point of note is that BBQ, one of the datasets we used, is often used as a benchmark in most popular models, including the models we have tested. The difference in accuracy for tests by the authors of this article versus by model developers was studied through literature research. The difference in metrics and experimental setup was determined to be the main cause. The authors noticed that for reported benchmarked responses, most models are not one-shot responses, and also, instead of text generation, they are classification tasks (e.g., the models have to choose from ‘yes’, ‘no’, ‘maybe’, etc.), and this could strongly attribute to a better performance, as the surface area of possible answers is reduced through the multiple choice question style.