1. Introduction
The concept of artificial intelligence (AI) was first introduced at the Dartmouth Conference in 1956 [
1]. Early medical applications of AI were limited to simple research and conceptual work. In the early 1970s and 1980s, the first medical expert systems, such as MYCIN, CASNET, and INTERNIST-I, had their beginnings. MYCIN, one of the first rule-based systems, was developed to assist physicians in diagnosing bacterial infections and providing antibiotic therapy [
2]. Systems such as INTERNIST-I [
3] and CASNET [
4] have taken this a step further by focusing on more complex diagnostic questions, thus contributing to the development of knowledge representation and clinical reasoning tools for the medical domain. From 2010 to 2020, improvements in deep learning changed the way AI was used in healthcare applications. Regulatory agencies began to limit, regulate, and approve the clinical use of AI algorithms as entities that could improve diagnostics, outcomes, and efficiency in healthcare [
5].
In recent years, generative models, including GPT-3 and GPT-4 (generative pre-trained transformers), combined with other LLMs (large language models), have begun to rapidly change the application and routine use of generative AI by internet users, especially younger users. In a medical context, generative AI is increasingly being tested and implemented into clinical routines, from facilitating medical documentation [
6] to aiding in diagnostic decision-making [
7] and improving patient interactions [
8]. A recent umbrella review concluded that ChatGPT holds great potential as an educational and clinical support tool in healthcare, but its safe and effective use depends on strong ethical guidelines and regulatory frameworks [
9]. Furthermore, it underscores the need to address issues like bias, overreliance, and trust through coordinated efforts across the healthcare sector. Thus, experts are speaking out against their unregulated use [
10]. One reason for this is that the use of ChatGPT-4 can frequently result in so-called “confabulations” or “hallucinations,” where the system generates fabricated references or inaccurate information [
11]. Another reason is that the results are highly dependent on the quality and precision of the user’s input [
7]. Confabulations in LLMs are factually incorrect or fabricated outputs that appear fluent and plausible, making them difficult to detect without expert knowledge. These errors occur because LLMs generate responses based on statistical patterns in training data, without a built-in mechanism to verify factual accuracy [
12].
In general, women and men exhibit different communication behaviors [
13] that extend to their use of information and communication technology (ICT) [
14]. Such differences in ICT use may imply variations in quality outcomes, particularly in specialized fields like medicine, while using generative AI systems. Furthermore, gender differences exist in language use in the context of computer-mediated communication [
15,
16]. Men seem to be more inclined to communicate in a direct language, with a primary focus on the content of the message. In contrast, women seem to focus on facilitating interpersonal connections and social interactions [
15,
16]. Recent studies have highlighted gender-based differences in the interaction with LLMs in terms of usage frequency, perceived utility, and communication style. One study found that male university students tend to use ChatGPT for longer sessions, whereas female students use it more frequently but for shorter periods [
17]. Another study reported that men predominantly utilize LLMs for analytical or technical tasks, while women more often rely on them for academic writing and theoretical explanations [
18]. Additionally, two studies observed that users often perceive ChatGPT as having a gender, influencing both their language and expectations [
19,
20]. ChatGPT is more likely to be seen as male when used for factual or problem-solving tasks and as female when providing empathetic responses [
19]. Moreover, language analyses reveal subtle gender biases in both user prompts and LLM outputs [
21]. It is important to examine how these differences affect the quality of decision-making in medical practice. Understanding such differences will be important in the optimization of LLM tools and the enhancement of clinical decision-making processes. Further research in this area could provide valuable insights into improving healthcare delivery and patient outcomes.
One domain that could very well benefit from the focused application of generative AI is occupational health. If its potential is utilized properly, occupational health experts will make efficient improvements in risk assessment, strategically work out plans to ensure workplace safety, and manage work-related health problems in a much more effective way. Incorporating generative AI could make data analysis easier, help identify workplace hazards early on, and create more tailored intervention plans, all of which can lead to better health outcomes for workers.
In particular, in Germany, the field of occupational medicine is characterized by significant complexity due to regulations such as the Ordinance on Occupational Diseases (BKV—Berufskrankheiten-Verordnung) [
22] and various legal frameworks. This complexity often creates uncertainty among non-specialist physicians when classifying occupational diseases, as traditional research methods may not provide sufficient or clear information.
This study explores gender-specific differences in the use of ChatGPT by medical students and physicians while solving occupational lung disease cases. As large language models are increasingly integrated into clinical education, understanding how user characteristics such as gender, expertise, and communication style could affect interactions with ChatGPT is essential. Little is known about how gender influences these interactions. By analyzing real user inputs and outcomes, this study aims to provide initial insights into differential communications and outcomes. In this study, a subgroup analysis of a superordinate study [
23] was conducted, in which three cases from the field of occupational medicine were designed and presented to a group of students and physicians. The participants were assigned to solve the cases using ChatGPT. In addition, the entries made by participants were recorded and analyzed for further comparison. A comparative analysis was conducted to evaluate potential differences in the use of generative LLMs between male and female participants, with a focus on variations in the quality and accuracy of output generated.
2. Materials and Methods
2.1. Study Design
The current study is a subgroup analysis of a recent study comparing ChatGPT-4 and common internet research conducted by physicians and medical students [
23] (
Figure 1).
Using a virtual coin toss, participants were randomly assigned to one of two groups, as follows: (1) research via ChatGPT-4 or (2) research with internet tools routinely used by the participants, e.g., Google or UpToDate. Demographic data was collected, including self-reported gender, with answer options of female, male, or diverse, according to official gender entries on German passports. After the collection of self-assessments of occupational medical knowledge, three cases involving occupational lung diseases were presented, with questions that were to be answered using the assigned research method. After case processing, satisfaction with the research method, as well as self-assessment of occupational medical knowledge post-processing, was recorded.
2.2. Participants
The recruitment and study processes of the superordinate project are described in detail here [
23]. In brief, participants were recruited through notice board announcements and personal contacts. The online study, conducted in German, required participants to be either enrolled in a medical degree program or actively practicing as a physician. Exclusion criteria were being under 18 years of age or having no medical background.
For this subgroup analysis, only participants who were assigned to the ChatGPT group were included. Two groups were then formed according to self-reported gender. There were no participants who indicated a gender other than female or male. Since this was a subgroup analysis, no matching or pairing was conducted. The groups were formed afterward, according to gender.
2.3. Research Settings
The English translation of the German survey is available in
Table S1 and in [
23]. The survey presented three cases, each based on real occupational medicine patients and modified for the study, with six questions per case. The cases were as follows:
Case 1: An outdoor worker with incidental findings of asbestos-related pleural changes on thoracic CT, leading to a suspected occupational disease report.
Case 2: A young woman working in galvanization developed a metal sulfate allergy.
Case 3: A former dental technician with recognized berylliosis.
Each case included six questions in yes/no, multiple-choice, or free-text formats, with a “don’t know” option for each. Questions appeared three at a time alongside the case vignette on the screen. Respondents used either an integrated ChatGPT window (
Figure 2) or their usual research tools, as indicated. Responses could not be altered once submitted to prevent learning effects from later questions. Copying of questions or vignettes into the chat window was disabled in order to track input accuracy and the phrasing of participants. The number of correct answers and “don’t know” answers were counted for evaluation. Comparisons were made between responses given before and after the group assignment for questions initially answered without research.
After each case, participants could choose to continue to another case or proceed to the final questions, ensuring that the concluding questions were addressed. The final questionnaire reassessed participants’ occupational medicine expertise, recorded which research tools were used, and gathered feedback on their research experience, including both positive and negative comments.
2.4. Research Instrument: Application
A web application was developed for the study, integrating a question-answering environment, group assignment, and case view. A chat window with a ChatGPT interface was integrated into the case view to facilitate user-friendly data collection. All data, including chat entries and responses, were stored in an SQL (structured query language) database. The application integrated OpenAI’s GPT-4 model via the Chat Completions API (application programming interface), ensuring user privacy by routing communications through private servers to prevent OpenAI from accessing participants’ IP addresses.
Participants’ input, along with previous input and responses in ongoing chats, was used as context for the LLM. The model’s responses were streamed in sections to allow for earlier viewing. The most current model at the time, GPT-4-0125-preview, was used, with a system prompt instructing it to function as an assistant for medical students and physicians investigating occupational diseases in Germany. The prompt instructed the model to ensure the accuracy of the information, to ask clarifying questions in case of ambiguity, and to provide concise answers. Default values for prompting GPT-4-Turbo were used via the Chat Completions API from OpenAI to mimic real-world usage of ChatGPT. Settings included the default temperature, t = 1.0, top_p = 1.0, and no max_tokens limit. The model had a max output token limit of 4096.
2.5. ChatGPT in- and Output
The study examined the inputs made by participants and the outputs generated by ChatGPT, including demographic data, self-assessments before and after casework, and satisfaction with ChatGPT, rated using German school grades from 1 (very good) to 6 (unsatisfactory), respectively. The number of words entered and generated was recorded, along with the nature of participants’ communication, such as whether ChatGPT was directly addressed and whether the input was provided in complete sentences or as keywords. Inputs were also analyzed for the use of adverbs (e.g., please), concretizations, personal pronouns (I, me, my), spelling mistakes, grammatical errors, and message and character counts. Notable aspects of participants’ inputs were documented. For the quantitative analysis, the number of correct answers and “don’t know” answers were counted and compared. The output analysis involved character counts, as well as the identification and assessment of confabulations and consecutive mistakes. To minimize observer bias, the analysis was conducted with the participants’ genders blinded, and were revealed only after the analysis was completed.
2.6. Data Analysis
The data were analyzed using GraphPad Prism version 10.2.3 (GraphPad, La Jolla, CA, USA) and SPSS version 29.0.0.0 (Statistical Package for the Social Sciences, Inc., Chicago, IL, USA). Data are presented as the number (n) with proportion in % (Table 1), the number of answers or correct answers (Tables 2–4 and 6), or as mean values with standard deviation (Tables 1–6 and 8). Group differences between genders were examined using either the Mann–Whitney U-test or the Chi
2 test for group sizes of at least 5 individuals; otherwise, the Fisher exact test was applied (Tables 2–6 and 8) and
Supplementary Table S3). Comparisons of responses before and after some questions were conducted using the Wilcoxon test (Tables 2–4). All statistical tests were two-sided, with a significance level of
p < 0.05. Correction for multiple comparisons was conducted via the Benjamini–Hochberg procedure (false discovery rate, FDR). Multiple linear regression was used to test whether age and gender predicted self-assessment after using the ChatGPT application. The normal distribution of residuals was checked using Q-Q plots. Multicollinearity was tested using the variance inflation factor (VIF). For all independent variables, VIF values ranged between 1 and 1.8, indicating no relevant collinearity. The existence of extreme outliers could be excluded using Cook’s distance.
4. Discussion
This gender-stratified subgroup analysis explored the use of ChatGPT4 by female and male physicians, as well as medical students, to research occupational lung disease cases within occupational medicine. Female participants demonstrated higher baseline expertise in occupational medicine as indicated by self-assessment ratings and their ability to correctly answer knowledge-based questions without AI assistance. Although both genders benefited from ChatGPT, experiencing improvements in understanding and case-solving abilities, female participants reported greater overall satisfaction with the AI system and a more pronounced increase in self-perceived expertise following its use.
Male participants generally used more formal language, but their queries often contained grammatical or typographical errors. Conversely, female participants encountered more frequent incorrect or confabulated responses from ChatGPT. Nevertheless, these inaccuracies did not negatively impact their performance, suggesting effective identification and mitigation of misinformation.
These findings highlight significant gender-based variations in communication with ChatGPT, especially regarding self-perceived expertise. Variations in professional communication across gender and other social determinants of health have been widely documented. Gendered communication patterns in professional settings can, for example, include competence-questioning communication [
16]; in male-dominated fields, assertive communication may be penalized in women [
17]. Beyond gender, factors like language proficiency and cultural alignment also shape communication outcomes; individuals from minority or migrant backgrounds often face higher risks of miscommunication or reduced engagement in professional and healthcare interactions, reinforcing structural inequalities [
18,
19]. Regarding medical professionals, existing research consistently identifies substantial gender differences in physician self-assessment, with women often rating their abilities lower than their male counterparts, despite objectively comparable skills and qualifications [
20,
21,
22,
23,
24,
25]. For instance, in emergency medicine, male residents rated themselves higher than females, but external assessments showed no gender differences [
21]. Similarly, female physicians rated their central venous catheterization skills lower than men, despite equal performance [
22].
Several factors contribute to this phenomenon, including societal expectations that encourage women toward modesty regarding their competencies, potentially leading to systematic underestimation of their professional capabilities [
26]. Moreover, female physicians frequently experience misidentification as non-physicians, which exacerbates self-doubt and negatively influences their self-perception [
27]. Additionally, female physicians may internalize negative feedback more deeply, resulting in sustained adverse effects on their self-confidence, career progression, and overall job satisfaction [
28].
Gender-specific communication patterns emerged in interactions with ChatGPT. Male participants generally used more formal language, incorporating polite expressions such as “please” and personal pronouns. This pattern may reflect a preference for formal or structured communication rather than perceiving the AI as a conversational partner. Such an approach is likely influenced by various social and psychological factors, shaping human interactions with AI. Polite language, including terms like “please,” often signifies respect, while the use of personal pronouns suggests engagement [
29].
In the present study, participants were explicitly informed that they were interacting with a research method rather than a chatbot simulating human behavior. It is, therefore, notable that male participants maintained a formal style of interaction, while female participants did not. One possible explanation is that women, having initially rated their occupational medicine expertise higher, may have approached the AI as an equal in terms of knowledge. This perceived competence could have fostered a more confident and direct interaction. In contrast, male participants, who assessed their own expertise as lower, may have viewed themselves more as students seeking answers to relatively straightforward questions, thus adopting a more deferential communication style.
Bandura’s self-efficacy theory suggests that individuals with a stronger sense of perceived competence are more likely to demonstrate initiative and approach tasks as collaborative efforts [
30]. This concept of self-efficacy has been empirically validated across diverse demographic groups regarding gender differences [
31] and other socioeconomic factors like family income and supportive academic environments [
32]. Self-efficacy could also be relevant in the context of AI, with users treating interactions with ChatGPT as collaborative tasks rather than purely transactional. In this study, enhanced self-efficacy among female participants facilitated confident interactions with ChatGPT, promoting deeper, more reciprocal dialogues reflective of genuine expertise. In contrast, male participants demonstrated a more deferential interaction style, typically prioritizing straightforward answers over exploratory dialogues [
27].
Male queries contained more grammatical or typographical errors. Conversely, female participants encountered more frequent incorrect or confabulated responses from ChatGPT. Nevertheless, these inaccuracies did not negatively impact their performance. This suggests an ability to critically assess and filter AI-generated information. This skill may be linked to their more thorough review process. In this study, female participants employed different research strategies and approaches compared to their male counterparts. Notably, female participants more frequently requested all possible answers in multiple-choice questions, reviewed the generated output, and independently determined the correct response. In contrast, male participants were more likely to present all five answer choices to ChatGPT and rely on the AI to select the correct option. Additionally, only female participants introduced new ideas beyond the given questions, but this exploratory approach increased their exposure to AI-generated confabulations, potentially leading to the absorption of false information. These differing approaches suggest that female participants engaged in a more comprehensive review process, potentially contributing to a greater overall increase in knowledge. In contrast, male participants received only a single response per question, limiting their exposure to additional information. These gender-based patterns align with broader findings on communication differences, where women are generally more detail-focused and cautious in decision-making, while men tend to favor reliance on direct output [
33,
34]. Such tendencies may be further influenced by other social determinants of health, including education level, professional role, and digital literacy, which collectively shape how users communicate in general and with AI tools [
35].
Notably, the female participants’ ability to fact-check ChatGPT could also stem from their existing occupational medicine expertise. With lower competency, men might have struggled to identify confabulations, potentially integrating false information into their memory and influencing case-processing outcomes.
These findings have two key implications. First, they highlight how gender-specific problem-solving strategies impact learning and the risk of absorbing misinformation, emphasizing the role of communication and decision-making styles in AI interactions. Second, they underscore the importance of AI literacy in recognizing and mitigating confabulations. Understanding these gender-associated tendencies is essential for designing AI tools that accommodate diverse learning approaches. The fact that increased exposure to confabulations did not lead to more false answers among female participants suggests that their engagement with multiple sources may serve as a cognitive safeguard against misinformation. This insight is valuable for integrating AI into education, where fostering critical evaluation skills can help prevent the spread of inaccuracies across different user groups.
4.1. Limitations
This subgroup analysis of the superordinate study [
14] included 27 participants. However, only 15 participants completed all three cases, resulting in a very small sample size. Due to this limitation, generalizations and extrapolations to other medical specialties or physicians in general are not advisable. This subgroup analysis was conducted merely to provide initial insights into possible gender-dependent effects. In any case, the results should be interpreted with caution, as well as replicated and verified by future studies with a larger number of cases. Furthermore, particularly for case one, the number of women was higher than that of men. This relation changed over the next two cases, with an almost equal proportion of women and men for case three. However, this could have influenced the reported results. Dropout analysis showed no significant differences for participants who dropped out during the study and participants who completed it, but this does not completely rule out attrition bias. Outlier analysis revealed no relevant outliers. However, due to the small sample size and the limited recording of possible independent variables influencing outcomes, results cannot be generalized to other settings. Additionally, this study predominantly consisted of students from a certain university, introducing a potential sample and selection bias that must be considered in interpreting the findings. Nevertheless, this study represents the first investigation into the application of ChatGPT in occupational medicine.
Most existing research on ChatGPT in medicine primarily analyzes AI-generated responses to specific inquiries. To the best of our knowledge, this study is among the first to examine real user inputs from medical personnel, monitor their interactions, and assess the subsequent AI-generated outputs. Thus, despite the study’s limitations, its findings offer valuable insights, particularly for designing future studies.
Given the small sample size, complex statistical analyses were not feasible. Only significant gender differences were reported, and no comprehensive model assessing the influence of factors such as gender, age, and background could be developed. The intersectionality of these variables, which likely influences outcomes, could not be adequately examined.
Future studies should aim for larger sample sizes to enable a more robust analysis. However, recruitment proves challenging, requiring personal outreach and repeated reminders. This highlights two critical issues, as follows: participant recruitment is inherently difficult, making larger sample sizes challenging to achieve, and study samples are likely to be selective and may not fully represent the target population.
Participants completed all questionnaires independently, without the opportunity to seek clarification on instructions. While their chat interactions were recorded and reviewed, no apparent misunderstandings in case processing were observed. However, the possibility of unrecognized issues or distortions cannot be entirely ruled out. Furthermore, self-reported data on gender and medical background were not independently verified. Trans persons would have most likely stated their gender and not their gender initially assigned at birth. Nonetheless, as the primary study did not focus on gender-specific analysis and participants were unaware of such an investigation, deliberate misreporting of gender appears unlikely. Furthermore, AI literacy and experience were not characterized, nor was participants’ expertise in prompt formulation. At the beginning of the survey, participants were asked what research methods they frequently used, and none of the participants of this subgroup indicated ChatGPT. It could be concluded that participants did not routinely use ChatGPT. However, since this was not specifically asked about or evaluated, this influencing factor cannot be conclusively assessed and should be investigated in future studies.
The increasing adoption of generative AI (like ChatGPT) in healthcare introduces substantial risks related to AI-generated confabulations. Such inaccuracies pose serious implications for clinical practice, as uncritical reliance on AI-generated information can lead to clinical errors, misdiagnoses, or inappropriate treatments [
36]. Aggregate effects of societal inequities are embedded in AI systems, amplifying gender and intersectional biases by replicating historical patterns of exclusion in medical training data, which risk cementing traditional roles in healthcare.
Hence, in the current state, AI should strictly serve as a supportive tool rather than a primary medical resource.
Furthermore, medical personnel should be intensively trained in the use of AI and the risks of using it in everyday clinical practice. To optimize training, integrating gender-specific insights into AI education is recommended. Leveraging diverse cognitive strengths across genders can enhance decision-making capabilities. Training should emphasize critical evaluation of AI-generated content, cross-referencing with trusted medical resources, and balancing tendencies toward overconfidence (often observed among male physicians) with thorough, cautious decision-making styles (commonly exhibited by female physicians). Employing practical training methods, such as role-playing scenarios and case studies, can facilitate collaborative learning and mutual appreciation of differing problem-solving approaches. Such training initiatives may affect end-user groups differently, as digital literacy, communication preferences, and cognitive styles vary across gender, age, and professional experience. Tailoring programs to these factors can improve engagement, reduce the risk of misuse, and promote equitable and effective adoption of AI in clinical workflows.
4.2. Research Implications
The finding that female participants (despite more exposure to confabulation) demonstrated higher accuracy could be explained by their higher baseline expertise in occupational medicine. This suggests that AI tools like ChatGPT may be especially effective when used by individuals with existing domain knowledge. This was also demonstrated in current research [
37]. This could emphasize the need to tailor AI applications to user expertise levels. Furthermore, a user without topic-specific knowledge could pose a risk for clinical decision-making in occupational disease cases. Targeted AI training should, therefore, focus on improving confabulation detection and cross-verification strategies.
Gender disparities in technology use influence public health, particularly in family health management, patient engagement, and health education [
38]. Women, often primary caregivers, frequently use digital tools for health-related tasks, while men may engage less with such platforms, limiting their role in health decision-making [
39,
40]. Physicians also exhibit gendered technology usage. Female physicians more frequently integrate patient-centered digital tools, enhancing communication and accessibility [
41]. AI-driven health education programs that fail to account for gender preferences may struggle to engage diverse populations effectively [
42].
Beyond individual user characteristics, the accessibility of LLM AI tools such as ChatGPT must also be critically examined in the context of social determinants of health. Factors such as digital literacy, access to reliable internet, language proficiency, and comfort with technology can significantly influence AI-supported learning environments [
35]. These structural determinants may reinforce existing inequities if not adequately addressed. Ensuring equitable access to LLMs will be essential for harnessing their full potential in medical education and public health contexts.
In conclusion, this study highlights how gender differences, possibly linked to different communication styles, could influence the use and impact of AI tools like ChatGPT in medical education and practice. In this study, female participants, who reported higher initial expertise, engaged more collaboratively and benefited more from AI, while male participants adopted a more formal, task-oriented approach. These findings, if replicated and confirmed in other studies, could suggest the need for gender-sensitive—or at least communication style–targeted—AI training to enhance learning outcomes and promote critical evaluation skills.
Future research should examine how problem-solving strategies and self-efficacy affect AI effectiveness in clinical education. Expanding sample sizes and incorporating diverse demographics will provide deeper insights into how gender intersects with other factors in AI interactions. Longitudinal studies are needed to assess AI’s long-term impact on confidence, knowledge retention, and clinical performance.
Additionally, AI literacy training should emphasize critical thinking to mitigate risks from misinformation. Developing guidelines for AI integration in medical education and patient care will be essential. By acknowledging gender-based differences, medical education can optimize AI use, improve healthcare outcomes, and foster equitable, inclusive training for physicians.