Next Article in Journal
Personalized Non-Player Characters: A Framework for Character-Consistent Dialogue Generation
Next Article in Special Issue
Understanding Social Biases in Large Language Models
Previous Article in Journal
A Hybrid and Modular Integration Concept for Anomaly Detection in Industrial Control Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Who Is to Blame for the Bias in Visualizations, ChatGPT or DALL-E?

by
Dirk H. R. Spennemann
1,2
1
School of Agricultural, Environmental and Veterinary Sciences, Charles Sturt University, Albury, NSW 2640, Australia
2
Libraries Research Group, Charles Sturt University, Wagga, NSW 2678, Australia
Submission received: 17 March 2025 / Revised: 25 April 2025 / Accepted: 28 April 2025 / Published: 29 April 2025
(This article belongs to the Special Issue AI Bias in the Media and Beyond)

Abstract

Due to range of factors in the development stage, generative artificial intelligence (AI) models cannot be completely free from bias. Some biases are introduced by the quality of training data, and developer influence during both design and training of the large language models (LLMs), while others are introduced in the text-to-image (T2I) visualization programs. The bias and initialization at the interface between LLMs and T2I applications has not been examined to date. This study analyzes 770 images of librarians and curators generated by DALL-E from ChatGPT-4o prompts to investigate the source of gender, ethnicity, and age biases in these visualizations. Comparing prompts generated by ChatGPT-4o with DALL-E’s visual interpretations, the research demonstrates that DALL-E primarily introduces biases when ChatGPT-4o provides non-specific prompts. This highlights the potential for generative AI to perpetuate and amplify harmful stereotypes related to gender, age, and ethnicity in professional roles.

1. Introduction

Ever since the public release in late 2022, generative artificial intelligence (AI) models, such as large language models (e.g., ChatGPT) and text-to-image models (e.g., DALL-E) have seen a rapid growth in popularity, leading to an increasing incorporation into the workflow of offices, schools and private life. Fundamental to the success was the ability to use plain-language prompts when communicating with generative AI models to receive plain language responses in return, as well as to use plain-language prompts to generate images. The key benefits of text-to-image models are that the images can be tweaked with structured prompts, often in an interactive fashion by refining the prompts, to develop complex images. A major concern, so far understated in the literature, is that fact that these text-to-image creations are copyright free and can be used by their creator in any medium and at any quantity desired. This makes these a cheap alternative to stock photography where reproduction fees are commonly charged based on the nature and quantity of their use. Consequently, it can be posited that the use of such generative AI-created imagery in the media and advertising will only increase. Underlying the rapid creation and dissemination of text-to-image generated visualizations is that, if created thoughtlessly, they will include visual biases derived from gender, age, ethnicity, occupational and other stereotypes which through their reproduction, will be widely disseminated and amplified.
Gender, age and ethnic (‘racial’) stereotypes that are pervasive social constructs based on generalisations and othering, which act as a shorthand in social communication [1,2] Such stereotypes when associated with professions and occupations are harmful, as they negatively affect people’s self-esteem and career prospects, and may result in various forms of prejudice and discrimination, both subtle and explicit) [3,4,5]. Therefore, people who do not conform to these stereotypical expectations often self-censor (‘stereotype threat’) and avoid pursuing such careers, thereby not only deny their own aspirations (with possible associated mental health impacts), but also further reinforce the existing stereotypes through underrepresentation. While image creations through art or photography involves a conscious decision which elements to include in the portrayal of a person and the image background, text-to-image generated depictions may include elements that were not included in the prompts but are embedded as derived from underlying biases or associations during the image generation process. As the biases and nuances may be subtle, the user generating the image may not be aware of the secondary messaging the image may convey. And even if the user is aware, the text-to-image generator may well be non-responsive to modification prompts [6].
Generative Ai models cannot be completely free from bias due a range of factors such as model architecture, the quality of training data, and developer influence during both design and training. Bias may arise from ideological perspectives, language selection during training, or reliance on sources that may contain outdated or racially biased information [7,8,9]. There is substantial evidence of gender stereotyping in generative AI responses [10,11], particularly in the reinforcement of normative identities and narratives, such as stereotypical binary gender roles [12,13,14,15]. Several studies have explored gender representation across various professions [10,11,16]. A cross-professional study found that 60% of the professions depicted aligned entirely with common gender stereotypes (e.g., mechanics being male and nurses being female) [17].
A substantial body of research has examined gender and ethnic bias in generative AI text-to-image applications such as DALL-E3, DALL-E4, Midjourney, and Stable Diffusion. Comparisons between Ai -generated text prompts and the resulting images suggest that biases are either introduced or amplified by the image generation algorithms [18]. In text-to-image Ai-generated images, covert bias through inadequate representation is widespread [19]. For instance, when gender is unspecified, Ai image generators predominantly depict men in “important” roles more frequently than women particularly in the medical field [20,21,22,23,24,25,26,27,28,29,30,31]. Similarly, unless explicitly directed otherwise, ChatGPT-generated prompts tend to produce images primarily featuring individuals of Caucasian appearance [20,21,22,23,24,25,26,27,28,29,30,32,33]. Other ethnic biases manifest in portrayals of African-American individuals in service roles [33], in contrast to Caucasian-looking characters, reinforcing patterns observed in textual analyses [34,35]. Beyond gender and ethnicity, Ai image generators also depict male and female figures with stereotypical facial and body features [24,30,36], often favoring younger individuals [21,24,28]. In single-shot prompting, representations of pregnant women or individuals with visible disabilities or impairments are notably absent [24].
This paper draws on an existing data set of 770 images that had been generated by ChatGPT-4o for an assessment of the representation of librarians [20] and curators [21]. The study investigates the biases present at the interface between large language models (LLMs) and text-to-image (T2I) applications. Specifically, it seeks to determine whether observed gender, ethnicity, and age biases in visualizations are introduced by ChatGPT-4o’s autogenerated prompts or by DALL-E’s visual interpretations of these prompts.
Despite the popular stereotypes of a librarian as a frumpy, middle-aged and unattractive spinster [20] and that of a nerdy bearded curator, conservative in dress and stuck in time rummaging through collections [21], the professions of librarians and curators are in fact gender neutral and thus are eminently suitable to examine underlying biases in generative AI. The scope of this paper is limited to the examination of these specific types of biases in the context of images generated by the ChatGPT-4o/DALL-E combination.

2. Methodology

This paper employs a mixed quantitative and qualitative analysis to examine 640 images of librarians and curators, as well as 200 women working in cultural industries generated from ChatGPT-4o prompts, analyzing the distribution of gender, ethnicity, and age in both the prompts and the rendered images as rendered by DALL-E.

2.1. The Data

Two data sets drawn on for this study had been generated for independent studies that examined the visual representation of librarians [20] and curators [21]. An additional data set contained representations of 200 women working in the cultural industries [37]. The studies for which these datasets were generated did not examine the causes of any observed biases.
The prompts given to ChatGPT-4o in the original studies were intentionally designed in an unconstrained fashion to avoid the injection of user bias and to prevent responses that are a priori biased towards the user’s perceptions. The following prompts were used in the two datasets:
“Think about [type of library] and the librarians working in these. Provide me with a visualization that shows a typical librarian against the background of the interior of the library.”
“Think about [type of museum] and the curators working in these. Provide me with a visualization that shows a typical curator against the background of the interior of the museum.”
ChatGPT-4o autonomously generated the detailed textual prompts for DALL-E to create the images, ensuring that no human intervention influenced the conceptualization or execution of the images. Each generated image was saved to disk, and the Ai-generated prompt was retrieved from the image panel (Figure 1) and stored alongside the image in a data file. After saving, the chat session was deleted to ensure a clean and unbiased generation process for new images, preventing any legacy information from influencing subsequent outputs.

2.2. Data Transparency

According to an established protocol [38], all images as well as the prompts that ChatGPT-4o/DALL-E used to generate the images discussed in this paper have been archived as formal data sets at the author’s institution and are readily accessible [37,39,40]. Given the size of the data sets, these are not reproduced in this paper and the reader is directed to the three sources.

2.3. Scoring

Scored were visual cues to gender (male/female), ethnicity and age. The scoring of age was based on facial representation alone, using on personal long-term assessment of imagery. Gender was classified as male/female. This binary classification excludes representations of non-binary individuals. This was adopted as none of the prompts specifically refer to non-binary.
Given that all images represented professionals at work, it was appropriate to divide the working age into three equal-sized bins (20–34, 35–49, 50–65), which were labelled ‘young’, ‘middle aged’ and ‘old’. The bin ‘middle aged’ (mid 30s to late 40s) functioned as the key classifier, defined by fine lines and wrinkles around the eyes and mouth (‘smile lines’/‘nasolabial folds’) and forehead, cheeks looking less full, and the midface appearing flatter (‘volume loss’) and skin looking less firm, around the jawline and neck, and under the eyes. Depictions with facial features lacking the middle-aged characteristics were classed as young, while depictions with white hair and/or pronounced wrinkles were classed as old. A second opinion was sought where it was not possible to unequivocally allocated into one of the three classifications. In addition, the presence of a beard for males, hair style (bob, open long hair, bun, ponytail) for women, as well as the presence of glasses/spectacles or a book were considered.
The coding of image prompts occurred based on specifications in the text (e.g., ‘middle-aged’, ‘woman’ etc.).

2.4. Statistics

Summary data and frequencies were established using MS Excel, while the statistical comparisons were established with MEDCALC’s comparison of proportions calculator (MedCalc Software, 2018, https://www.medcalc.org/).

2.5. Limitations

This study focuses on an examination of gender, age and ethnicity bias in the prompts used by ChatGPT and DALL-E and the resulting visualization by DALL-E. As both systems are continually upgraded, responses may vary over time. To account for this, the full responses have been documented in the datasets [37,39,40].

3. Results

Unless a user issues a very specific image generation prompt and restricts ChatGPT-4o to that prompt, ChatGPT-4o will interpret the user request and based on this, will autonomously generate the prompt that is issued to DALL-E for rendering. As the resulting image comes with a text of the prompt used (accessible via the image panel, see Figure 1), it is possible to examine the image generation sequence:
user prompt → ChatGPT-4o interpretation → DALL-E interpretation
The prompts autonomously generated by ChatGPT-4o can be specific, giving age or gender (Figure 2A); provide an inferred reference to age or gender (Figure 2B); or be silent on both age and gender (Figure 2C). DALL-E will then interpret and render these.

3.1. Gender

While the setting and paraphernalia describe a profession, even though some may be stereotypical (‘scientists wear lab coats’), the depiction of apparent gender is an important obvious signifier that influences how viewers respond to an image. Unless underlying conditions circumscribe the outcome, the representation of men and women should be equal in a larger dataset. In 77.8% of cases where the prompt autonomously generated by ChatGPT-4o specified a gender, that apparent gender was female.
Contextual identification to gender in the prompts was as obvious as the inclusion of ‘beard’ which would be uniquely male. In seven instances the choice of ‘tailored suit’ allowed for a positive identification, while in one stance the prompt text ‘tailored suit or a chic dress’ allowed DALL-E to generate either option (a male was rendered) (Table 1).
Table 1. Gender representation in prompts and rendered images.
Table 1. Gender representation in prompts and rendered images.
Apparent Gender as Rendered
FemaleMaleTotal
gender specifiedfemale37340
male or female242852
male23739
gender inferred via contextmale (tailored suit) 77
possibly male (blazer)599104
dual gender (cardigan etc.)151126
no gender prescription 104308412
Total187493680
Figure 2. Examples of prompts autonomously generated by ChatGPT-4o and the resulting visualization by DALL-E. (A) prompt with specific age reference; (B) prompt with inferred gender reference (‘tailored suit’); (C) prompt that is silent on both age and gender.
Figure 2. Examples of prompts autonomously generated by ChatGPT-4o and the resulting visualization by DALL-E. (A) prompt with specific age reference; (B) prompt with inferred gender reference (‘tailored suit’); (C) prompt that is silent on both age and gender.
Ai 06 00092 g002
Almost a third of the prompts included the term ‘blazer’, 95.2% of were rendered as male (Table 1). The blazer is traditionally an exclusively male piece of attire, which predominantly remains to be so, although has also become part of the female professional wardrobe (‘power suit’) [41,42]. To clarify this, we can draw on a DALL-E generated image dataset of 200 women working in cultural and creative industries [37], which includes 80 librarians and curators. The term ‘blazer’ was included in 28.5% of the prompts (‘cardigan’ being 17.0%), although in 13% the inclusion of additional gender specific terms (‘blouse’, ‘skirt’) modulated ‘blazer’ and would have defined the output.
If ChatGPT attributes a blazer to both males and females, then the rendering of 95.2% of prompts as male suggests that DALL-E has an underlying bias towards directly or indirectly classifying cultural industry professionals such as librarians and curators as male. This is further underlined when considering the apparent gender representation in those instances where the prompt specification remained completely neutral. Here, 84.9% of visualizations rendered the professional as male (Table 1).
To assess whether the ambiguous prompt (‘tailored suit or a chic dress with subtle accessories’) would cause DALL-E to generate a balanced representation, ten iterations of the exact prompt text were run, resulting in eight images of a male and one image of female curator, with one image rendering a pair of curators of either gender. When the key sequence was reversed (‘chic dress with subtle accessories or a tailored suit’), DALL-E generated seven images of a male and one image of female curator, one image of pair of curators of either gender, and one image of a pair of male curators.
The question arises as to where the biases exhibited by ChatGPT and by DALL-E come from. To be relevant for either of the models, any comparative statistical data for the apparent gender profiles for librarians and curators (that form the basis of the analyzed data set) must predate the data cut-off for ChatGPT and DALL-E training data in September 2021. Considering English-speaking countries, in the USA, 82.5% of librarians and 60.2% of curators identified as women (2017) [43] while in Australia the figures were 83.8% and 67.6% respectively (2016) [44]. In the United Kingdom 76.1% librarians report as female [45], compared with 80% in Canada [46]. Taking the average gender balance of librarians and curators in these English-speaking countries as a guide, the picture emerges that approximately 70% of the workforce are women.

3.2. Age

Of the 680 autonomously ChatGPT4o-generated prompts only 22.6% specified an age, mainly ‘middle aged’ (Table 2). Among these, middle-aged staff dominate (62.3%), a proportion that is significantly higher than the representation of young staff (21.4%; χ2 = 52.757, df = 1, p < 0.0001) and older staff (16.2%; χ2 = 68.406, df = 1, p < 0.0001). Although ‘middle aged’ was specified, DALL-E rendered 16.7% of the ‘middle aged’ professionals as young and 15.6% as old. That a combined 32.3% are misrepresented is also very significant (χ2 = 23.935, df = 1, p < 0.0001).
For the remaining 73.4% of the image prompts ChatGPT-4o was silent on age specifications and it was left to DALL-E to interpret and image. Contextually, as the position descriptions for both librarians and curators require a university degree, and it can be anticipated that the bulk of curators would belong to the middle-aged and old age groups. Among the prompts that provided no age specifications, however, the ages of professional as rendered, are dominated by young staff (59.3%), a proportion that is significantly higher than the representation of middle-aged staff (34.0%; χ2 = 67.577, df = 1, p < 0.0001), which in turn are significantly more represented than older staff (6.7%; χ2 = 120.814, df = 1, p < 0.0001). This demonstrates clearly that the prompt generation by ChatGPT overrepresents middle aged professionals (15.2%:71.7%:13.1%) while the visualizations by DALL-E overrepresent young staff (67.3%:26.6%:7.6%).
There are only few age profiles for librarians and curators that predate the training data cut-off in September 2021. When concatenated into the classes young (≤34), middle aged (35–54) and old (≥55), census data for the USA provide proportions of 22.0%:36.2%:41.9% for librarians (2017) [43] and 33.4%:28.1%:38.5% for curators (2020) [47]. When comparing the generative AI representations with USA data of staff working as cultural industry professionals (such as librarians and curators), then the overrepresentation of young staff (χ2 = 31.285, df = 1, p < 0.0001) and underrepresentation of older staff by DALL-E is very significant (χ2 = 29.070, df = 1, p < 0.0001), as is the overrepresentation of middle-aged staff (χ2 = 31.174, df = 1, p < 0.0001) and the underrepresentation of older staff (χ2 = 18.691, df = 1, p < 0.0001) by ChatGPT.
Returning to the additional 200 women working in cultural industries, in 92.0% of all instances ChatGPT-4o did not provide any age specification in its autogenerated prompts. Of these, DALL-E rendered 88.0% as young, 11.4% as middle-aged and 1% as old (Table 3).
When the rendered images of 200 women working in cultural industries were loaded into ChatGPT-4o to classify their appearance in 10-year age cohorts, the overwhelming majority (88%) was classified as belong to the 20–30-year age bracket (Table 3). Dissonances between the ChatGPT-4o classification and the visual classification carried out by the authors was 10.5%. Each of the dissonant images was independently assessed by a second party, confirming the author’s classification (Table 4).

3.3. Ethnicity

In terms of ethnicity, only seven of the 680 images (1.03%) were depicted as other than Caucasian. These were six depicted with Asian features and one with Hispanic features [39]. All the 200 women in cultural professions were rendered as Caucasian. As none of the autonomously ChatGPT4o-generated prompts referred to ethnicity or diversity, this racial bias is solely due to the training data and algorithm used for DALL-E. That bias is extreme and exceeds that observed in other studies ®.
Unlike the representation of ages and gender discussed above, the ethnic bias does reflect the state of the work force of librarians. Based on the statistics for the ethnicity of museum curators and librarians, 86.0% of librarians and 86.2% of curators in the USA were Caucasian (2017) [43]. Available figures for librarians in Canada (89%) [48] and the United Kingdom (91.9%) [45] are even higher. There are no data that allow us to correlate the observed bias in the ethnic representation of women in cultural professions with actual figures.

3.4. Glasses

A small number of generated images (11.3%) depicts librarians with glasses. Considering gender, DALL-E renders those prompts where ChatGPT-4o did not specify a gender very significantly more likely to be male than female (81.8%, χ2 = 26.292, df = 1, p < 0.0001), Similarly, among the five prompts that were written as bi-gender, DALL-E rendered 80% as male (Table 5).
When considering the age distribution among people shown to be wearing glasses, there is a dissonance between the specified ages and the rendered ages, with rendered ages tending to be greater (Table 6). Among prompts where ChatGPT-4o did not specify an age, representations of people with glasses significantly more likely to be middle-aged than young (χ2 = 7.091, df = 1, p = 0.0077) or old (χ2 = 5.772, df = 1, p = 0.163). In contrast, the sample of 200 women working in cultural industries showed an overwhelming bias towards the depiction of young women where glasses were not mentioned in the prompts (89.7%) (Table 7).

4. Discussion

The paper has shown that biases are introduced both at the initial generation of prompts by ChatGPT and the again when these prompts are interpreted by DALL-E. These biases result in imagery of individuals who are overwhelmingly Caucasian, predominantly male and young. It appears that these biases are inherent in the algorithms used by ChatGPT and DALL-E and that they come to the fore when these generative AI systems are required to fill in blank demographic or other details.
Based on the observations made in this study, we can propose the following flow of prompt interpretation and the nature and magnitude of the biases generated by ChatGPT and DALL-E (Figure 3). Following the user prompt, which in this study was specifically chosen to be unconstrained to avoid the injection of user bias and to prevent responses that are a priori biased towards the user’s perceptions, ChatGPT-4o interprets the prompt and generates a prompt for parsing to DALL-E. At this point biases can be created by direct specific mention of gender, age or ethnicity, or by inclusion of attributes in dress style that act as contextual directions to predicate a specific gender, age or ethnicity. Once parsed, that prompt is interpreted by DALL-E, which then applies its own ’interpretation’, thereby either correctly rendering the specific mentions (in most but not all cases), interpreting the contextual directions, or by applying its own interpretation to the unspecified prompt (essentially filling in the demographic blanks). It is at that point that DALL-E exhibits biases, such as an overrepresentation of males, of Caucasians and of young people.
As this paper has shown, ChatGPT-4o will generate image prompts to be parsed to DALL-E that nominate or infer a gender, Gender-specific prompts are not common compared to prompts where a gender, predominantly male, can be inferred contextually from specified clothing. As noted earlier, there is an abundance of evidence of gender stereotyping in generative AI responses, showing that the output is congruent with popular perceptions of professions-cum-gender stereotypes [10,11]. In this context, then, it is important to note that the stereotype of the librarian in major English speaking countries is that of an older, white woman [49,50,51] while the stereotype of a curator is overwhelmingly white and predominantly female [52]. Yet, as this analysis has shown, ChatGPT construes them contextually as male. While at this point in time the origin of the dissonance cannot be explained with a modicum of certainty, it is likely that underlying training biases in ChatGPT classify people in professional positions predominantly as male, irrespective of actual gender composition.
OpenAI used red teaming [53] to validate and moderate responses generated by ChatGPT with team members drawn from a wide range of expertise. While no specialists in museology or librarianship were represented, broadly relevant to the images analyzed here were red team members with self-reported domains of expertise in Anthropology, Sociology and Education [54]. The red team membership was biased towards educational and professional backgrounds (PhD’s or significant higher education/industry experience), coming from English-speaking, Western countries (U.S., Canada, U.K.) which would have influenced how they interpreted and flagged politics, values, and other representations of the model [55]. OpenAI does not comment on the gender and ethnic balance in the membership of its red team. While biases and some hallucinations in ChatGPT output are acknowledged, [45] no systemic biases in ChatGPT have been flagged by OpenAI [54,55,56,57].
The image creation process by DALL-E entails that a user-provided prompt is encoded through the auto-regressive transformer encoder, image tokens are sampled sequentially based on the decoder’s predicted distribution over the next token and the resulting sequences of image tokens are decoded through the VQVAE decoder [58]. The best of the multiple images is selected via CLIP (Contrastive Language-Image Pre-training) [59] and presented to the user [60]. DALL-E was trained on an image data set comprised of 250 million image and caption pairs [60] excluding graphic sexual and violent content as well as images of some hate symbols [61]. During the training phase of DALL-E, OpenAI used red teaming [53] to validate and moderate generated images. The red team membership was similar to that deployed for the evaluation of ChatGTP-4 [61,62].
OpenAI acknowledged in its system cards that biases will be “present in DALL·E 2 training data and the way in which the model is trained… how and for whom the system is designed; which risks are prioritized with associated mitigations; how prompts are filtered and blocked; and how uploads are filtered and blocked. Further, bias stems from the fact that the monitoring tech stack and individuals on the monitoring team have more context on, experience with, and agreement on some areas of harm than others” [61]. OpenAI further acknowledged that “default behavior of the DALL·E 2 Preview produces images that tend to overrepresent people who are White-passing and Western concepts generally…[and] tends to serve completions that suggest stereotypes, including race and gender stereotypes. For example, the prompt “lawyer” results disproportionately in images of people who are White-passing and male-passing in Western dress, while the prompt “nurse” tends to result in images of people who are female-passing” [61]. That bias persisted in DALL·E 3 as acknowledged by OpenAI: “by default, DALL·E 3 produces images that tend to disproportionately represent individuals who appear White, female, and youthful” [62] unless the prompts specifically moderate the output.
The data presented in this paper show that where the ChatGPT injected prompt specification remained gender neutral, DALL-E rendered the professional as male in 84.9% of visualizations (Table 1). This runs counter to the general observation that DALL-E renders people as female in unconstrained prompts [62], but confirms other studies that examined DALL-E visualizations [17,22,23,24,25,26,27,28,63]. On the other hand, the data in the foregoing study confirmed known DALL-E biases of youthfulness and a white/Caucasian bias [62,64], with young staff dominating (67.3%) the depiction of librarians and curators, while middle-aged and older staff were significantly underrepresented (Table 2).
One of the considerations could be the currency of the information that was included in the training data sets. For training data the development of ChatGPT drew on readily accessible and copyright free sources, which were comprised of old books and government publications, but also included internet text and Wikipedia pages, giving it some contextual ‘currency’ until the cut-off date of September [8,65]. While the same is more or less true for the development of DALL-E, that dataset may well be biased towards older imagery. A study by Shin et al., for example, indicated that generative diffusion models may have a preponderance of drawing on historical and cultural stereotypes when generating images based on prompts, rather than responding to direct and explicit prompt cues. All tested models, Stable Diffusion, DALL-E and Firefly, suffered from this [66].
It appears that while societal biases and stereotypes are ported into the algorithm via the training data, these are not caught and corrected during the red hatting phase of system evaluation. As noted elsewhere when assessing the cause of a persistence of bitcoin imagery in DALL-E generated images [6], the red hatting team may have normalized existing societal biases in their own daily lives and thus may not have been conscious of these when faced with imagery. Moreover, given that biases only become recognizable as a pattern when larger data sets are generated, it can be speculated that these biases could have escaped attentions during the red hatting phase as only a select number of images of a given configuration would have been generated. Given the harmful effects the perpetuation of social biases may have on individuals, these observations suggest that larger datasets of the same prompts need to be generated during the red hatting phase to identify and weed out the perpetuation of societal biases in generative AI content. It also suggests that the extensive representation of societal biases that is currently evident in generative AI content [10,11,12,13,14,15,16,19,24,25,27,28,29,30,31,32,33] is mitigatable with greater diligence during the internal evaluation period and preventable when completely new algorithms are designed and rolled out.

5. Conclusions

Considering a data set of 770 visualizations of librarians and curators derived from zero-shot prompts to ChatGPT this paper shows bias and initialization at the interface between LLMs and T2I applications which had not been examined to date. Comparing the prompts generated byChatGPT-4o with their visual interpretations by DALL-E, it explored biases of gender, ethnicity, and age. It could be demonstrated that most biases in visualizations were introduced by DALL-E in those cases where ChatGPT-4o generated non-specific image prompts, forcing DALL-E to fill in demographic blanks.
Users of the ChatGPT to DALL-E interface need to be aware of the biases introduced by both elements, by ChatGPT when interpreting the user request and creating the prompt to be parsed to DALL-E, and then by DALL-E when interpreting the ChatGPT-generated prompt. The more open ended the user-issued request, the more both ChatGPT and DALL-E are called upon to fill in the demographic blanks—with the result that their inherent, training-data derived biases come to the fore. Of potential concern for the novice, causal or uncritical user is the fact that both ChatGPT and DALL-E are subject to several inherent biases and that these perpetuate potentially quite harmful stereotypes. The generated visualizations create an impression that the professions are dominated by males, when in reality they are predominantly female, and that they are comprised of youthful individuals when in reality the median of age of staff is in the upper forties and lower fifties (depending on country). The only aspect of the DALL-E visualizations that somewhat reflects reality is the dominance of White Caucasians, even though the real proportions are considerably less than those depicted by DALL-E. While the stereotyping of the ethnic balance may reflect reality, they are deplorable as ethnic (‘racial’) stereotypes in relation to professions and occupations are damaging as they harm a person’s self-esteem and career prospects and may lead to discrimination (from subconscious to overt). In the case of curators and librarians, the ethnic pro-Caucasian bias reinforces the stereotype that such professions are not for non-whites, thus perpetuating the existing demographic bias and undermining ambitions members of other ethnic communities may have.
To overcome these algorithmic biases, users need to eschew the use of unconstrained single shot prompts. Rather seeking a ‘quick-fix’ solution by asking ChatGPT/DALL-E “provide me with a visualization that shows a typical XYZ”, they should instead issue a prompt that specifies gender, age and ethnicity. On the side of the developer, it would be desirable if the output DALL-E as parsed to ChatGPT could routinely present the user with two images, side by side, one depicting a female character and the other a male character. In terms of functionality this already exists as users are frequently asked which of two renderings, offered side-by-side, they might prefer.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available via doi 10.26189/951bb16a-3dda-4a84-bd3d-994203d28c7e, doi 10.26189/e77b1068-b059-492c-94e9-7cf0c9333622 and doi 10.26189/5320dc25-a98f-4a80-b241-86f5fdca999e.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Snyder, M. On the self-perpetuating nature of social stereotypes. In Cognitive Processes in Stereotyping and Intergroup Behavior; Psychology Press: London, UK, 2015; pp. 183–212. [Google Scholar]
  2. Shih, M.; Bonam, C.; Sanchez, D.; Peck, C. The social construction of race: Biracial identity and vulnerability to stereotypes. Cult. Divers. Ethn. Minor. Psychol. 2007, 13, 125. [Google Scholar] [CrossRef] [PubMed]
  3. Greenwald, A.G.; Banaji, M.R.; Rudman, L.A.; Farnham, S.D.; Nosek, B.A.; Mellott, D.S. A unified theory of implicit attitudes, stereotypes, self-esteem, and self-concept. Psychol. Rev. 2002, 109, 3. [Google Scholar] [CrossRef] [PubMed]
  4. Heilman, M.E. Gender stereotypes and workplace bias. Res. Organ. Behav. 2012, 32, 113–135. [Google Scholar]
  5. Ellemers, N. Gender stereotypes. Annu. Rev. Psychol. 2018, 69, 275–298. [Google Scholar] [CrossRef] [PubMed]
  6. Spennemann, D.H.R. Non-responsiveness of DALL-E to exclusion prompts suggests underlying bias towards Bitcoin. SSRN Prepr. 2025. [Google Scholar]
  7. Choudhary, T. Political Bias in AI-Language Models: A Comparative Analysis of ChatGPT-4,Perplexity, Google Gemini, and Claude. IEEE Access 2025, 13, 11341–11365. [Google Scholar] [CrossRef]
  8. Spennemann, D.H.R. The origins and veracity of references ‘cited’ by generative artificial intelligence applications. Publications 2025, 13, 12. [Google Scholar] [CrossRef]
  9. Tao, Y.; Viberg, O.; Baker, R.S.; Kizilcec, R.F. Cultural bias and cultural alignment of large language models. PNAS Nexus 2024, 3, 346. [Google Scholar] [CrossRef]
  10. Kaplan, D.M.; Palitsky, R.; Arconada Alvarez, S.J.; Pozzo, N.S.; Greenleaf, M.N.; Atkinson, C.A.; Lam, W.A. What’s in a name? Experimental evidence of gender bias in recommendation letters generated by ChatGPT. J. Med. Internet Res. 2024, 26, e51837. [Google Scholar] [CrossRef]
  11. Duan, W.; McNeese, N.; Li, L. Gender Stereotypes toward Non-gendered Generative AI: The Role of Gendered Expertise and Gendered Linguistic Cues. Proc. ACM Hum.-Comput. Interact. 2025, 9, 1–35. [Google Scholar]
  12. Gillespie, T. Generative AI and the politics of visibility. Big Data Soc. 2024, 11, 20539517241252131. [Google Scholar] [CrossRef]
  13. Gross, N. What ChatGPT tells us about gender: A cautionary tale about performativity and gender biases in AI. Soc. Sci. 2023, 12, 435. [Google Scholar] [CrossRef]
  14. Desai, P.; Wang, H.; Davis, L.; Ullmann, T.M.; DiBrito, S.R. Bias Perpetuates Bias: ChatGPT Learns Gender Inequities in Academic Surgery Promotions. J. Surg. Educ. 2024, 81, 1553–1557. [Google Scholar] [CrossRef] [PubMed]
  15. Farlow, J.L.; Abouyared, M.; Rettig, E.M.; Kejner, A.; Patel, R.; Edwards, H.A. Gender Bias in Artificial Intelligence-Written Letters of Reference. Otolaryngol.-Head Neck Surg. 2024, 171, 1027–1032. [Google Scholar] [CrossRef]
  16. Urchs, S.; Thurner, V.; Aßenmacher, M.; Heumann, C.; Thiemichen, S. How Prevalent is Gender Bias in ChatGPT?—Exploring German and English ChatGPT Responses. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Turin, Italy, 18–22 September 2023; pp. 293–309. [Google Scholar]
  17. Melero Lázaro, M.; García Ull, F.J. Gender stereotypes in AI-generated images. El Prof. De La Inf. 2023, 32, e320505. [Google Scholar] [CrossRef]
  18. Saumure, R.; De Freitas, J.; Puntoni, S. Humor as a window into generative AI bias. Sci. Rep. 2025, 15, 1326. [Google Scholar] [CrossRef]
  19. Hacker, P.; Mittelstadt, B.; Borgesius, F.Z.; Wachter, S. Generative discrimination: What happens when generative AI exhibits bias, and what can be done about it. arXiv 2024, arXiv:2407.10329. [Google Scholar]
  20. Spennemann, D.H.R.; Oddone, K. What Do Librarians Look Like? Stereotyping of a Profession by Generative Ai. J. Librariansh. Inf. Sci. Subm. 2025; (under review). [Google Scholar]
  21. Spennemann, D.H.R. Draw Me a Curator: Stereotyping of a Profession by Generative Ai. Curator Mus. J. Subm. (under review).
  22. Abdulwadood, I.; Mehta, M.; Carrion, K.; Miao, X.; Rai, P.; Kumar, S.; Lazar, S.; Patel, H.; Gangopadhyay, N.; Chen, W. AI Text-to-Image Generators and the Lack of Diversity in Hand Surgeon Demographic Representation. Plast. Reconstr. Surg.–Glob. Open 2024, 12, 4–5. [Google Scholar] [CrossRef]
  23. Currie, G.; Chandra, C.; Kiat, H. Gender Bias in Text-to-Image Generative Artificial Intelligence When Representing Cardiologists. Information 2024, 15, 594. [Google Scholar] [CrossRef]
  24. Currie, G.; John, G.; Hewis, J. Gender and ethnicity bias in generative artificial intelligence text-to-image depiction of pharmacists. Int. J. Pharm. Pract. 2024, 32, 524–531. [Google Scholar] [CrossRef]
  25. Morcos, M.; Duggan, J.; Young, J.; Lipa, S.A. Artificial Intelligence Portrayals in Orthopaedic Surgery: An Analysis of Gender and Racial Diversity Using Text-to-Image Generators. J. Bone Jt. Surg. 2024, 106, 2278–2285. [Google Scholar] [CrossRef] [PubMed]
  26. Ali, R.; Tang, O.Y.; Connolly, I.D.; Abdulrazeq, H.F.; Mirza, F.N.; Lim, R.K.; Johnston, B.R.; Groff, M.W.; Williamson, T.; Svokos, K. Demographic representation in 3 leading artificial intelligence text-to-image generators. JAMA Surg. 2024, 159, 87–95. [Google Scholar] [CrossRef] [PubMed]
  27. Lee, S.W.; Morcos, M.; Lee, D.W.; Young, J. Demographic representation of generative artificial intelligence images of physicians. JAMA Netw. Open 2024, 7, e2425993. [Google Scholar] [CrossRef] [PubMed]
  28. Gisselbaek, M.; Suppan, M.; Minsart, L.; Köselerli, E.; Nainan Myatra, S.; Matot, I.; Barreto Chang, O.L.; Saxena, S.; Berger-Estilita, J. Representation of intensivists’ race/ethnicity, sex, and age by artificial intelligence: A cross-sectional study of two text-to-image models. Crit. Care 2024, 28, 363. [Google Scholar] [CrossRef]
  29. Zhou, M.; Abhishek, V.; Derdenger, T.; Kim, J.; Srinivasan, K. Bias in generative AI. arXiv 2024, arXiv:2403.02726. [Google Scholar]
  30. York, E.J.; Brumberger, E.; Harris, L.V.A. Prompting Bias: Assessing representation and accuracy in AI-generated images. In Proceedings of the 42nd ACM International Conference on Design of Communication, Fairfax, VA, USA, 20–22 October 2024; pp. 106–115. [Google Scholar]
  31. Sandoval-Martin, T.; Martínez-Sanzo, E. Perpetuation of Gender Bias in Visual Representation of Professions in the Generative AI Tools DALL·E and Bing Image Creator. Soc. Sci. 2024, 13, 250. [Google Scholar] [CrossRef]
  32. Wiegand, T.; Jung, L.; Schuhmacher, L.; Gudera, J.; Moehrle, P.; Rischewski, J.; Velezmoro, L.; Kruk, L.; Dimitriadis, K.; Koerte, I. Demographic Inaccuracies and Biases in the Depiction of Patients by Artificial Intelligence Text-to-Image Generators. Preprints 2024. [Google Scholar] [CrossRef]
  33. Hosseini, D.D. Generative AI: A problematic illustration of the intersections of racialized gender, race, ethnicity. OSF Prepr. 2024. [Google Scholar] [CrossRef]
  34. Amin, K.S.; Forman, H.P.; Davis, M.A. Even with ChatGPT, race matters. Clin. Imaging 2024, 109, 110113. [Google Scholar] [CrossRef]
  35. Hofmann, V.; Kalluri, P.R.; Jurafsky, D.; King, S. Dialect prejudice predicts AI decisions about people’s character, employability, and criminality. arXiv 2024, arXiv:2403.00742. [Google Scholar] [CrossRef]
  36. Lio, P.; Ahuja, K. Beautiful Bias from ChatGPT. J. Clin. Aesthetic Dermatol. 2024, 17, 10. [Google Scholar]
  37. Spennemann, D.H.R. Two Hundred Women Working in Cultural and Creative Industries: A Structured Data Set of Generative Ai-Created Images; School of Agricultural, Environmental and Veterinary Sciences, Charles Sturt University: Albury, NSW, Australia, 2025. [Google Scholar] [CrossRef]
  38. Spennemann, D.H.R. Children of AI: A protocol for managing the born-digital ephemera spawned by Generative AI Language Models. Publications 2023, 11, 45. [Google Scholar] [CrossRef]
  39. Spennemann, D.H.R.; Oddone, K. What Do Librarians Look Like? Stereotyping of a Profession by Generative Ai—Supplementary Data; School of Agricultural, Environmental and Veterinary Sciences, Charles Sturt University: Albury NSW, Australia, 2025. [Google Scholar] [CrossRef]
  40. Spennemann, D.H.R. What Do Curators Look Like? Stereotyping of a Profession by Generative Ai—Supplementary Data; School of Agricultural, Environmental and Veterinary Sciences, Charles Sturt University: Albury, NSW, Australia, 2025. [Google Scholar] [CrossRef]
  41. Moore, M.M.; Williams, G.I. No jacket required: Academic women and the problem of the blazer. Fash. Style Pop. Cult. 2014, 1, 359–376. [Google Scholar] [CrossRef] [PubMed]
  42. Kwantes, C.T.; Lin, I.Y.; Gidak, N.; Schmidt, K. The effect of attire on expected occupational outcomes for male employees. Psychol. Men Masculinity 2011, 12, 166. [Google Scholar] [CrossRef]
  43. Data USA Librarians. Available online: https://datausa.io/profile/soc/librarians (accessed on 12 March 2025).
  44. Western Australia. 6273.0 Employment in Culture, 2016; Cultural and Creative Statitics Working Group, Office for the Arts: Perth, WA, Australia, 2021. [Google Scholar]
  45. Reddington, M.; Kinetiq. A Study of the UK’s Information Workforce 2023; Kinetiq: New York, NY, USA, 2024. [Google Scholar]
  46. Statistics Canada. Table 98-10-0449-01 Occupation Unit Group by Labour Force Status, Highest Level of Education, Age and Gender: Canada, Provinces and Territories, Census Metropolitan Areas and Census Agglomerations with Parts. Available online: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810044901 (accessed on 11 March 2025).
  47. Data USA Archivists, Curators, & Museum Technicians. Available online: https://datausa.io/profile/soc/archivists-curators-museum-technicians (accessed on 12 March 2025).
  48. Chan, J. Beyond tokenism: The importance of staff diversity in libraries. Br. Columbia Libr. Assoc. Perspect. 2020, 12. [Google Scholar]
  49. Luthmann, A. Librarians, professionalism and image: Stereotype and reality. Libr. Rev. 2007, 56, 773–780. [Google Scholar] [CrossRef]
  50. White, A. Not Your Ordinary Librarian: Debunking the Popular Perceptions of Librarians; Chandos Publishing: Oxford, UK, 2012. [Google Scholar]
  51. Robinson, L.T. Curmudgeons and dragons? A content analysis of the Australian print media’s portrayal of the information profession 2000 to 2004. Libr. Inf. Sci. Res. E-J. 2006, 16, 1–19. [Google Scholar] [CrossRef]
  52. Lawther, K. Who Are Museum Curators According To Pop Culture? AcidFree 2020, 2025. Available online: http://acidfreeblog.com/curation/who-are-museum-curators-according-to-pop-culture/ (accessed on 12 March 2025).
  53. Brundage, M.; Avin, S.; Wang, J.; Belfield, H.; Krueger, G.; Hadfield, G.; Khlaaf, H.; Yang, J.; Toner, H.; Fong, R.; et al. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. arXiv 2020, arXiv:2004.07213. [Google Scholar] [CrossRef]
  54. OpenAI. GPT-4o System Card; OpenAi: San Francisco, CA, USA, 2024. [Google Scholar]
  55. OpenAI. GPT-4 System Card; OpenAi: San Francisco, CA, USA, 2024. [Google Scholar]
  56. OpenAI. GPT-4V(ision) System Card; OpenAi: San Francisco, CA, USA, 2024. [Google Scholar]
  57. OpenAI. GPT-4.5 System Card; OpenAi: San Francisco CA, USA, 2025. [Google Scholar]
  58. Van Den Oord, A.; Vinyals, O. Neural discrete representation learning. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Neural Information Processing Systems Foundation, Inc.: South Lake Tahoe, NV, USA, 2017; Volume 30, pp. 1–10. [Google Scholar]
  59. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  60. Dayma, B.; Patil, S.; Cuenca, P.; Saifullah, K.; Abraham, T.; Lê Khac, P.; Melas, L.; Ghosh, R. DALL-E Mini Explained. Available online: https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mini-Explained-with-Demo--Vmlldzo4NjIxODA (accessed on 12 March 2025).
  61. OpenAI. DALL-E2 System Card. Available online: https://github.com/openai/dalle-2-preview/blob/main/system-card.md (accessed on 12 March 2025).
  62. OpenAI. DALL-E3 System Card; OpenAi: San Francisco, CA, USA, 2023. [Google Scholar]
  63. Currie, G.; Hewis, J.; Hawk, E.; Rohren, E. Gender and ethnicity bias of text-to-image generative artificial intelligence in medical imaging, part 2: Analysis of DALL-E 3. J. Nucl. Med. Technol. 2024, 52, 356–359. [Google Scholar] [CrossRef]
  64. Liu, S.; Maturi, T.; Yi, B.; Shen, S.; Mihalcea, R. The Generation Gap: Exploring Age Bias in the Value Systems of Large Language Models. arXiv 2024, arXiv:2404.08760. [Google Scholar]
  65. Choudhary, T. Reducing racial and ethnic bias in AI models: A comparative analysis of ChatGPT and Google Bard. In Proceedings of the 36th International RAIS Conference on Social Sciences and Humanities, Princeton, NJ, USA, 6–7 June 2024; pp. 115–124. [Google Scholar]
  66. Shin, P.W.; Ahn, J.J.; Yin, W.; Sampson, J.; Narayanan, V. Can Prompt Modifiers Control Bias? A Comparative Analysis of Text-to-Image Generative Models. arXiv 2024, arXiv:2406.05602. [Google Scholar] [CrossRef]
Figure 1. Screenshot showing a ChatGPT-4o image panel and DALL-E ‘reading’ of the prompt (pop-up window) together with the text of the prompt as issued (right panel).
Figure 1. Screenshot showing a ChatGPT-4o image panel and DALL-E ‘reading’ of the prompt (pop-up window) together with the text of the prompt as issued (right panel).
Ai 06 00092 g001
Figure 3. The reality of the user—ChatGPT—DALL-E image generation process based on the observations of the data in this paper. Illustrated is the flow of the gender-bias.
Figure 3. The reality of the user—ChatGPT—DALL-E image generation process based on the observations of the data in this paper. Illustrated is the flow of the gender-bias.
Ai 06 00092 g003
Table 2. Age representation in prompts and rendered images.
Table 2. Age representation in prompts and rendered images.
Apparent Age as Rendered
YoungMiddle AgeOldTotal
age specifiedyoung1910433
middle aged16651596
old112325
no age prescriptionneutral31217935526
total34825577680
Table 3. Age representation in prompts and rendered images in a sample of 200 women working in cultural industries [37].
Table 3. Age representation in prompts and rendered images in a sample of 200 women working in cultural industries [37].
Age as Rendered
YoungMiddle-AgedOlderTotal
age specified
(ChatGPT terms)
young1 1
late 20s or early 30s1 1
mid-30s42 6
late 30s or early 40s 1 1
mid-40s 2 2
middle-aged14 5
no age prescriptionneutral162211184
total169301200
Table 4. ChatGPT-4o classification of the ages of 200 women working in cultural industries as shown in images generated DALL-E without age specification in the prompt.
Table 4. ChatGPT-4o classification of the ages of 200 women working in cultural industries as shown in images generated DALL-E without age specification in the prompt.
Age as Rendered
YoungMiddle-AgedOlderTotal
age cohort as identified by ChatGPT4o10–202 2
20–3016115 176
30–40613 19
40–50 2 2
50–60 11
total169301200
Table 5. Presence of glasses by gender in prompts vs. rendered images.
Table 5. Presence of glasses by gender in prompts vs. rendered images.
Gender as Rendered
FemaleMaleTotal
gender specifiedMale100.016
bigender20.080.05
female91.38.723
no age prescriptionneutral18.281.833
total36.463.677
Table 6. Presence of glasses by age in prompts vs. rendered images.
Table 6. Presence of glasses by age in prompts vs. rendered images.
Age as Rendered
YoungMiddle AgedOldTotal
age specifiedyoung40.030.030.010
middle aged9.145.545.511
old5.095.020
no age prescriptionneutral22.252.825.036
total16.936.446.877
Table 7. Presence of glasses by age of persons as prompt specifications vs rendered images (in percent of age as rendered). Sample of 200 women working in cultural industries.
Table 7. Presence of glasses by age of persons as prompt specifications vs rendered images (in percent of age as rendered). Sample of 200 women working in cultural industries.
Age as Rendered
YoungMiddle AgedOldTotal
age specifiedyoung62.512.58
middle aged12.587.58
old0
no age prescriptionneutral89.79.01.378
total82.616.31.192
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Spennemann, D.H.R. Who Is to Blame for the Bias in Visualizations, ChatGPT or DALL-E? AI 2025, 6, 92. https://doi.org/10.3390/ai6050092

AMA Style

Spennemann DHR. Who Is to Blame for the Bias in Visualizations, ChatGPT or DALL-E? AI. 2025; 6(5):92. https://doi.org/10.3390/ai6050092

Chicago/Turabian Style

Spennemann, Dirk H. R. 2025. "Who Is to Blame for the Bias in Visualizations, ChatGPT or DALL-E?" AI 6, no. 5: 92. https://doi.org/10.3390/ai6050092

APA Style

Spennemann, D. H. R. (2025). Who Is to Blame for the Bias in Visualizations, ChatGPT or DALL-E? AI, 6(5), 92. https://doi.org/10.3390/ai6050092

Article Metrics

Back to TopTop