Next Article in Journal
Experimental Study on Multi-Cell Counting Using an Inertial Microfluidic Device
Previous Article in Journal
Transforming Monochromatic Images into 3D Holographic Stereograms Through Depth-Map Extraction
Previous Article in Special Issue
Boundary-Aware Transformer for Optic Cup and Disc Segmentation in Fundus Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generative Artificial Intelligence and Risk Appetite in Medical Decisions in Rheumatoid Arthritis

Faculty of Medicine, Carol Davila University of Medicine and Pharmacy Bucharest, 050474 București, Romania
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(10), 5700; https://doi.org/10.3390/app15105700
Submission received: 16 April 2025 / Revised: 14 May 2025 / Accepted: 16 May 2025 / Published: 20 May 2025
(This article belongs to the Special Issue Machine Learning in Biomedical Sciences)

Abstract

:
With Generative AI (GenAI) entering medicine, understanding its decision-making under uncertainty is important. It is well known that human subjective risk appetite influences medical decisions. This study investigated whether the risk appetite of GenAI can be evaluated and if established human risk assessment tools are applicable for this purpose in a medical context. Five GenAI systems (ChatGPT 4.5, Gemini 2.0, Qwen 2.5 MAX, DeepSeek-V3, and Perplexity) were evaluated using Rheumatoid Arthritis (RA) clinical scenarios. We employed two methods adapted from human risk assessment: the General Risk Propensity Scale (GRiPS) and the Time Trade-Off (TTO) technique. Queries involving RA cases with varying prognoses and hypothetical treatment choices were posed repeatedly to assess risk profiles and response consistency. All GenAIs consistently identified the same RA cases for the best and worst prognoses. However, the two risk assessment methodologies yielded varied results. The adapted GRiPS showed significant differences in general risk propensity among GenAIs (ChatGPT being the least risk-averse and Qwen/DeepSeek the most), though these differences diminished in specific prognostic contexts. Conversely, the TTO method indicated a strong general risk aversion (unwillingness to trade lifespan for pain relief) across systems yet revealed Perplexity as significantly more risk-tolerant than Gemini. The variability in risk profiles obtained using the GRiPS versus the TTO for the same AI systems raises questions about tool applicability. This discrepancy suggests that these human-centric instruments may not adequately or consistently capture the nuances of risk processing in Artificial Intelligence. The findings imply that current tools might be insufficient, highlighting the need for methodologies specifically tailored for evaluating AI decision-making under medical uncertainty.

1. Introduction

Although the science of medicine has advanced substantially in recent years, medical decisions are still considered to rely significantly on the specific skills of the physician (apart from the scientific fundamentals). In this context, risk management is an integral component of medical practice—frequently, physicians must make decisions based on incomplete information, facing possibilities of clinical evolution that cannot be predicted with certainty. Medical decisions (and ultimately patient outcomes) are influenced by how the physician positions themselves regarding unknown information and the level of risk they are willing to accept in their choices. This defines their risk appetite or aversion, recognized as factors influencing how medical decisions are made. These factors depend on both intrinsic elements (the physician’s personality, previous experiences, etc.) and extrinsic elements (the actual and perceived quality of the medical microenvironment in which the physician operates) [1,2]. The assumption of risk by a physician is based on a complex analysis (potentially involving subconscious elements) evaluating potential gains from a correct decision compared to the costs associated with an incorrect one. As fear of malpractice has increased recently, physicians tend towards risk aversion, which may be exemplified by broader diagnostic and therapeutic schemes or the avoidance of definitive treatment decisions [3,4].
The rapid development of Artificial Intelligence-based information systems has enabled their entry into the medical field. Generative Artificial Intelligence (GenAI) encompasses a range of algorithms that learn from data to create new, original content, including text, images, or synthetic data, rather than only performing discriminative tasks like classification. These models, such as the Large Language Models (LLMs) evaluated in this study, identify patterns and structures in their training data to generate novel outputs. Other notable types of generative algorithms include Generative Adversarial Networks (GANs), which employ dual-network architecture for tasks like realistic image generation. The application of various GenAI techniques, including GANs, has seen a significant surge in clinical research over recent years, offering potential in areas like medical image synthesis, data augmentation, and drug discovery. While previous studies have focused on the accuracy and factual reliability of GenAI outputs in clinical settings, to the best of our knowledge, no prior research has systematically investigated how these systems handle uncertainty and risk in therapeutic decision-making. This study addresses that gap by applying validated human risk assessment tools to evaluate and compare the risk-related behavioral profiles of multiple GenAI models. Our study focuses specifically on the decision-making characteristics of prominent conversational GenAI systems when faced with medical uncertainty. GenAI are widely used, offering the general public the perception of user-friendly communication on a wide range of subjects. Among the commonly used GenAI models are ChatGPT, Gemini, DeepSeek, and Qwen, which are accessible globally. Their development methods and the datasets they were trained on differ, as do their inherent biases [5,6,7]. Nevertheless, both physicians and patients utilize them to obtain or analyze medical information. However, there is no precise method for estimating the level of bias in interactions with these GenAI tools, although a threshold of 15%, comparable to human error rates in the field, is sometimes considered [8,9,10].
The use of GenAI inputs in medical decision-making (either directly by the physician or indirectly through patients informing themselves using GenAI) should be evaluated considering both the quality of the information held by the AI system and how the GenAI processes this information and selects its response format. This response, however, is based on the same incomplete medical information available to the physician. If the physician must formulate their response considering risk appetite or aversion, it is plausible that GenAI systems employ a similar approach.
The primary objective of this study was to evaluate and characterize the risk appetite manifested by Generative Artificial Intelligence systems in the context of complex medical decisions. The research aimed to identify whether consistent patterns exist in the risk approach of GenAI systems when confronted with clinical situations involving different degrees of uncertainty and varying risk–benefit profiles. Specifically, the study aimed to determine whether GenAI systems demonstrate consistency in risk appetite or if it varies depending on the presented clinical context, evaluate whether differences exist in the recommendations offered for cases with favorable versus unfavorable prognoses (thus reflecting a potential adaptation of risk appetite according to case severity), and identify potential trends or biases in how GenAI systems address situations involving uncertainty and risk in a medical context. The results of this study may offer insights into the reliability and characteristics of recommendations provided by GenAI systems in the medical domain, contributing to a more nuanced understanding of how these technologies can be responsibly integrated into the clinical decision-making process.

2. Materials and Methods

In this study, we evaluated the concepts of risk appetite and risk aversion in the context of medical decisions, using five contemporary Generative Artificial Intelligence (GenAI) systems as sources of decisional perspective. We selected the following systems for analysis:
  • ChatGPT 4.5—Developed by OpenAI, this model represents an advanced iteration of the GPT (Generative Pre-trained Transformer) architecture. It is trained on a large corpus of text and multimodal data, being one of the most widely used GenAI systems globally. The model uses supervised learning and reinforcement learning from human feedback (RLHF) techniques [11].
  • Gemini 2.0—Developed by Google, this next-generation multimodal model is designed to process and generate diverse content types (text, image, and audio). Gemini utilizes advanced transformer architecture and is trained on diverse data, including specialized medical research [12].
  • Qwen 2.5 MAX—Developed by Alibaba Cloud, this model represents an advanced version of the Qwen architecture. It is a multilingual model with advanced natural language processing and complex task-solving capabilities, including in the medical domain [13].
  • DeepSeek-V3—Developed by DeepSeek AI, this model uses an advanced transformer-type architecture and is trained on a large data corpus to provide detailed and contextualized responses. It is noted for its ability to address complex problems in specialized domains [14].
  • Perplexity—A system based on a large language model, this model is optimized for information retrieval and factual responses. It is differentiated by its ability to access and synthesize current information from multiple sources [15].
All queries were performed between 1 and 15 April 2025, over 4 separate days with at least two days between them, using the same computer connected to the internet via the same IP address in the European Union. All GenAI systems were accessed via their publicly available interfaces without any form of fine-tuning, local deployment, or integration with custom knowledge bases; responses were generated based solely on standardized prompts in isolated sessions to ensure consistency and eliminate prior context influence.
The research was structured into three distinct phases, each employing complementary methodologies to evaluate risk tolerance in the medical context. In the initial phase of the study, we developed 20 clinical cases of Rheumatoid Arthritis (RA) with controlled variation in relevant clinical parameters. These parameters included disease duration, patient age, patient sex, number of conventional disease-modifying antirheumatic drugs (DMARDs) previously tried and failed, and number of biological DMARDs previously tried and failed. Each GenAI system was asked to evaluate the 20 cases and identify the case with the most favorable prognosis (hereafter referred to as “Case A”) and the case with the most unfavorable prognosis (hereafter referred to as “Case B”). It is noteworthy that all five GenAI systems independently identified the same cases as having the best and worst prognosis, respectively, providing a consistent basis for subsequent research stages. In the second phase of the study, we assessed risk appetite using an adapted version of the GRiPS (General Risk Propensity Scale), a validated instrument for evaluating general risk propensity [16,17]. The standard statements from the GRiPS questionnaire were specifically adapted to represent therapeutic statements applicable to a generic patient with Rheumatoid Arthritis (see Table 1).
To ensure data consistency and reliability, each GenAI system was queried twelve times in total, in random order: four queries without specifying a medical or non-medical decision context, four queries explicitly related to the decision to initiate a new treatment in the patient with the best prognosis (Case A), and four queries explicitly related to the decision to initiate a new treatment in the patient with the worst prognosis (Case B). Each question was formulated as follows: “The person described below has Rheumatoid Arthritis. The person is currently receiving a treatment that ensures complete remission for 10 years without any risk of adverse reactions. The person has the option of choosing a new treatment that could induce complete remission for the entire life, but it carries a 2% annual risk of thromboembolic events. The person is uncertain whether to remain on the current treatment or to take the risk with the new treatment. Rate the statements below from 1 to 5, giving a rating of 1 if, when discussing which option the person should choose, you would never say this to her, and a rating of 5 if you would always, without doubt, say this to this person”. In the third phase of the study, we began by assessing knowledge about pain and the significance of a moderate pain threshold. We asked, interspersed with other discussions over one day, “How important is a pain level of 4 on the VAS scale?”. Each GenAI was asked 3 times, and the correct answer—moderate pain intensity—was obtained in 100% of cases. Then, using a TTO (Time Trade-Off) methodology [18,19,20], adjusted for life expectancy in Rheumatoid Arthritis, we asked each GenAI to respond 8 times to the question: “The current treatment in Rheumatoid Arthritis offers the patient a 20% probability of remaining with pain of at least level 4 on the VAS pain scale and does not shorten their life. The following alternative treatments offer the following: 20 years of perfect health, without pain, followed by immediate death or 19 years of perfect health, without pain, followed by immediate death (and so on) … 1 year of perfect health, without pain, followed by immediate death. What is the minimum acceptable alternative for a patient aged X year: the current treatment (denoted as 0) or which of the alternative treatments?”. We presented the question in three versions: for a patient aged 20, 40, and 55 years, respectively, resulting in 24 questions for each GenAI.
To mitigate potential memory effects and ensure the independence of responses when querying the GenAI systems multiple times, several precautions were implemented. All queries were distributed over four separate days with at least a two-day interval between sessions. For the adapted GRiPS questionnaire, the 12 queries for each GenAI system were presented in a randomized order. For the TTO assessments, each of the 24 questions per GenAI was posed at the beginning of a new, separate interaction session, and these questions were also presented in a random sequence. These measures were designed to minimize the influence of prior interactions on subsequent responses.
For data analysis (executed with IBM SPSS Statistics v26) we evaluated the internal consistency of each GenAI system’s recommendations (by comparing responses to repeated questions), differences in risk appetite between cases with different ages, and variability in the acceptability thresholds for the duration–quality trade-off in the TTO methodology. This analysis allowed us to identify relevant patterns and trends in the approach to medical decisions by Generative Artificial Intelligence systems, with potential implications for understanding decision-making processes in the medical context.

3. Results

The analysis of data collected across the three study phases aimed to evaluate the reliability of GenAI responses and characterize risk appetite in various medical decision-making contexts.

3.1. Reliability and Consistency of Measurements (Phase 2—Adapted GRiPS Questionnaire)

Normality tests (Kolmogorov–Smirnov and Shapiro–Wilk) applied to the total scores obtained from the adapted GRiPS questionnaire indicated that data for ChatGPT and Qwen approached a normal distribution while data from DeepSeek, Gemini, and Perplexity did not follow a normal distribution. This finding justified the use of non-parametric statistical methods for subsequent comparisons between evaluators. To assess the consistency of each GenAI’s responses to repeated questions in the second phase, we calculated the coefficient of variation (CV) for total scores in each scenario (P—General; A—Good Prognosis; B—Poor Prognosis). Table 2 shows that consistency was generally good to acceptable (CV < 30%). A notable exception was Gemini in scenario A (Good Prognosis), where the CV was 39.15%, indicating higher variability in responses in this specific context. The other evaluators exhibited lower CVs, ranging from 3.51% (Perplexity, Scenario P) to 25.90% (DeepSeek, Scenario A), suggesting good intra-rater reliability under most conditions.

3.2. Differences Between Evaluators

The Kruskal–Wallis test was used to compare the distribution of total scores among the five GenAI evaluators, separately for each scenario (A, B, and P), the results are presented in Table 3. In both scenario A (Good Prognosis) and scenario B (Poor Prognosis), no statistically significant differences were found between evaluators (see Figure 1).
This suggests that the GenAIs had a relatively similar approach to risk when evaluating these cases. In scenario P (general nonmedical), statistically significant differences were found between evaluators. This result indicates that, in the absence of a specific clinical context (Good Prognosis or Poor Prognosis), the GenAIs exhibit significant differences in their general risk propensity, as measured by the adapted GRiPS questionnaire.
To identify which evaluators differed significantly in Scenario P, post hoc Mann–Whitney U tests were performed. The results show that ChatGPT had significantly lower scores (potentially indicating lower risk aversion in this context) compared to each of the other four evaluators (p = 0.020 vs. Perplexity; p = 0.041 vs. Gemini; p = 0.017 vs. Qwen; p = 0.017 vs. DeepSeek), while Gemini had significantly lower scores than Qwen (p = 0.017) and DeepSeek (p = 0.017). There were no significant differences between Qwen, DeepSeek, and Perplexity relative to each other, but they tended to have higher scores (higher risk aversion) compared to ChatGPT and Gemini. These results suggest a potential ranking of evaluators based on risk aversion in the general scenario: ChatGPT < Gemini < (Perplexity, Qwen, and DeepSeek).

3.3. Risk Appetite Assessed via Time Trade-Off (TTO) Method (Phase 3)

Analyzing the assigned utility scores (likely a normalization of accepted years, where 1.00 represents preference for the current treatment or acceptance of 20 years of perfect health), all evaluators had means close to 1.00 (ranging from 0.8563 for Perplexity to 0.9625 for Gemini) and medians of 1.0000. This indicates a strong general tendency among GenAIs not to trade-off years of life to avoid moderate pain, preferring the current situation or the alternative only if it offers the maximum duration of perfect health. However, Perplexity is notable for the lowest mean and the highest standard deviation (0.22904), suggesting greater variability and a slightly greater tendency to accept the trade-off compared to the other GenAIs. Skewness and Kurtosis values confirm the previously observed non-normal distributions.
Comparing mean TTO utility scores between pairs of evaluators, the only statistically significant difference (p < 0.05) was between Gemini and Perplexity (p = 0.045). Gemini had a significantly higher mean utility score than Perplexity, confirming that Gemini is more risk-averse (less willing to sacrifice years of life) in the TTO context compared to Perplexity. Other comparisons did not show significant differences, suggesting general similarity in the TTO approach among ChatGPT, Qwen, DeepSeek, and, to some extent, Gemini and Perplexity apart from their direct comparison.
The Friedman test was used to compare GenAI preferences (expressed as the minimum number of years of perfect health accepted in exchange for the current treatment) for each patient age group (20, 40, 55 years). For the first and second scenarios (age of the subject = 20 and 40, respectively), no significant differences were found between evaluators (Table 4). In the third scenario (age = 55 years) analyses of the mean ranks suggests a potential trend: Perplexity had the highest mean rank, indicating potentially greater willingness to accept fewer years of perfect health (higher risk tolerance in this TTO context), while Gemini had the lowest mean rank (2.25), suggesting the opposite (see Figure 2).

4. Discussion

This study explored the risk appetite of five Generative Artificial Intelligence (GenAI) systems within the context of medical decisions, utilizing two distinct methodologies: an adapted questionnaire (GRiPS) and the Time Trade-Off (TTO) method. The results offer a nuanced perspective on how these technologies approach uncertainty and risk.
A key initial finding is the convergence of GenAIs in the study’s first phase: all systems independently identified the same Rheumatoid Arthritis cases as having the most favorable (Case A) and unfavorable (Case B) prognosis among the 20 presented. This suggests a shared fundamental understanding of prognostic factors in this pathology, likely derived from the extensive medical datasets on which they were trained. However, beyond this basic prognostic assessment, significant divergences emerged regarding risk appetite. Analysis using the adapted GRiPS questionnaire showed that while GenAIs do not differ significantly in their risk approach when presented with a specific clinical context (Good or Poor Prognosis), they exhibit significant differences in general risk propensity (Scenario P). In this general scenario, ChatGPT proved significantly less risk-averse than all other tested GenAIs, whereas Qwen and DeepSeek showed the opposite tendency. This raises the question of whether the specificity of the clinical context anchors the GenAIs’ responses, reducing the variability observed in a more abstract scenario.
Comparing the results obtained via the two methodologies (adapted GRiPS and TTO) reveals that the risk assessment method influences the apparent risk profile of the GenAIs. Whereas in the GRiPS test (general scenario), ChatGPT emerged as the least risk-averse, in the TTO test, Perplexity demonstrated significantly higher risk tolerance compared to Gemini, being more willing to accept a smaller number of years in perfect health in exchange for eliminating the risk of moderate pain. Perplexity also exhibited the greatest variability in TTO responses. This discrepancy between methodologies suggests that GenAIs may process and respond differently to different risk-framing methods—either as a general propensity (GRiPS) or as a direct trade-off between quantifiable outcomes (TTO).
Despite differences, the TTO method revealed a strong general tendency towards risk aversion among all tested GenAIs. The mean and median utility scores very close to 1.00 indicate that, when faced with a trade-off between lifespan and quality of life (avoiding moderate pain), the GenAIs prefer, in most cases, not to sacrifice years of life unless the alternative guarantees perfect health for the maximum duration. This could reflect implicit programming or learned behavior to prioritize life preservation or avoid recommendations involving quantifiable losses, possibly as a safety measure. The Friedman test suggested potential differentiation in the TTO approach based on patient age, particularly for the 55-year-old group (p = 0.050), where Perplexity showed the highest risk tolerance (mean rank 4.31) and Gemini the lowest (mean rank 2.25). Although not strongly statistically significant, this aligns with the significant difference observed globally between Perplexity and Gemini in paired tests.
The study also investigated the reliability of GenAI responses. Coefficients of variation generally indicated good intra-rater consistency, with the notable exception of Gemini in the good prognosis scenario (A). This specific inconsistency may warrant further investigation—is it possible that Gemini has difficulty calibrating its responses in situations perceived as already having a favorable outcome? Normality tests also confirmed that response distributions vary among GenAIs, necessitating appropriate statistical approaches.

Implications and Limitations

The results highlight the importance of understanding that different GenAI systems may have distinct risk profiles, and these can vary depending on the clinical context and how risk-related questions are framed. This has significant implications for integrating GenAI into the medical decision-making process, either directly by clinicians or indirectly by patients using them for information. A uniform risk approach cannot be assumed across different AI platforms. This study, while offering valuable contributions to understanding how Generative Artificial Intelligence (GenAI) approaches risk in a medical context, has inherent limitations that require careful consideration when interpreting and generalizing its results. These limitations are important for defining the scope of the conclusions and guiding future research. A significant limitation is the restricted sample of evaluated GenAI systems. The study focused on five specific platforms (ChatGPT 4.5, Gemini 2.0, Qwen 2.5 MAX, DeepSeek-V3, and Perplexity), which, while popular and advanced at the time of research, do not represent the entire GenAI landscape. The field of Artificial Intelligence is highly dynamic, with new models and versions constantly emerging, each potentially having different architectures, training datasets, and design philosophies, including regarding safety and uncertainty management. Therefore, the risk profiles identified in these five GenAIs cannot be automatically extrapolated to other existing or future systems.
Secondly, the exclusive focus on a single pathology, Rheumatoid Arthritis (RA), limits the generalizability of the findings. RA, although a suitable model for complex decision-making in chronic disease management, has specific characteristics related to prognosis, therapeutic options, treatment-associated risks, and impact on quality of life. Decision-making processes and risk tolerance can vary substantially in other medical fields, such as oncology (where trade-offs between toxicity and survival are common), acute infectious diseases (where timeliness of decisions is critical), or palliative care. The risk appetite manifested by GenAIs in RA scenarios might not be replicated similarly in these different contexts.
A third important limitation is the time-bound nature of the results. Data were collected within a specific timeframe, between 1 and 15 April 2025. GenAI models undergo continuous training, updating, and fine-tuning processes, including techniques like RLHF (Reinforcement Learning from Human Feedback). These updates can alter not only the AI’s knowledge base but also its decision-making behavior, including how it weighs risks and benefits. Thus, the risk profiles observed in the study represent a snapshot in time and may change as the models evolve.
A fourth limitation is related to different ways the prompts are created. Future work should investigate how variations in prompt phrasing may alter AI risk-taking profiles, potentially affecting the generalizability of our findings.
Finally, the methodology based on text queries necessarily simplifies the complexity of real-world clinical interactions. Medical decisions are not based solely on written information exchange but involve dialog, clarifications, interpretation of non-verbal cues (in physician–patient interaction), integration of data from multiple sources (history, clinical examination, and investigations), and nuanced clinical judgment. Simulating this complex process via text prompts, while necessary for study standardization, cannot fully capture how a GenAI might interact or manifest its “risk appetite” within an integrated clinical workflow or a dynamic conversation with a clinician or patient. Responses to predefined prompts may not fully reflect the AI’s ability to navigate uncertainty in less structured situations.
Acknowledging these limitations is important. They do not invalidate the study’s findings but highlight the need for broader future research, including more GenAI systems, a wider range of clinical scenarios, longitudinal assessments to capture model evolution, and potentially, methodologies that more closely simulate actual clinical interactions. This study aimed to assess whether GenAI systems exhibit a higher or lower propensity for making medical decisions involving risk. Clinicians must understand that identical input information provided to different GenAI systems may lead to divergent recommendations, particularly when risk trade-offs are involved. Our findings suggest a certain predisposition (which requires further confirmation through repeated testing) for riskier decision-making in some GenAI systems.

5. Conclusions

Our findings indicate that while Generative AI systems can consistently evaluate basic prognostic factors in Rheumatoid Arthritis, their approach to risk under uncertainty is not uniform, varying significantly between different AI models. The divergence in risk profiles observed when applying adapted the GRiPS versus TTO methodologies suggests that the assessment of risk appetite in GenAI is method-dependent and may not be fully captured by adapting existing human-centric instruments. Specifically, GenAI systems displayed different risk propensities in abstract scenarios compared to when presented with concrete patient prognoses, highlighting the influence of context on their decision outputs. The strong aversion to trading lifespan for quality-of-life improvements (avoiding moderate pain) seen in the TTO results across most GenAIs might reflect underlying safety mechanisms or learned priorities, although the differing tolerance levels (e.g., Perplexity vs. Gemini) show this is not absolute. Ultimately, the demonstrated heterogeneity and the methodological dependencies in assessing GenAI risk appetite underscore the need for developing tailored evaluation frameworks before these technologies can be reliably integrated into complex medical decision-making processes. Future research should extend the analysis to other GenAI systems, including a broader range of medical conditions and decision-making scenarios, and explore the impact of different prompt types. Comparing the risk appetite of GenAI with that of human clinicians in the same scenarios would also provide valuable insights. Understanding the factors driving risk appetite differences among GenAI systems (architecture, training data, and RLHF) is essential for developing more reliable and transparent AI systems for the medical domain.

Author Contributions

Conceptualization, F.B.; methodology, F.B. and E.C.B.; project administration, F.B.; software, F.B. and D.A.; validation, F.B., D.A. and E.C.B.; writing—original draft, F.B.; writing—review and editing, E.C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Acknowledgments

The authors thank the Carol Davila University of Medicine and Pharmacy Bucharest for its continuous technical and scientific support. During the preparation of this manuscript/study, the authors used the following tools for data generation, hypothesis testing, and commentary generation after statistical evaluations: ChatGPT 4.5—developed by OpenAI; Gemini 2.0—developed by Google; Qwen 2.5 MAX—developed by Alibaba Cloud; DeepSeek-V3—developed by DeepSeek AI; and Perplexity. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
CVCoefficient of variation
DMARDsDisease-modifying antirheumatic drugs
GenAIGenerative Artificial Intelligence
GPTGenerative Pre-trained Transformer
GRiPSGeneral Risk Propensity Scale
K-SKolmogorov–Smirnov
NLPNatural language processing
RARheumatoid Arthritis
RLHFReinforcement Learning from Human Feedback
S-WShapiro–Wilk
TTOTime Trade-Off
VASVisual Analog Scale

References

  1. Tubbs, E.P.; Elrod, J.A.B.; Flum, D.R. Risk taking and tolerance of uncertainty: Implications for surgeons. J. Surg. Res. 2006, 131, 1–6. [Google Scholar] [CrossRef] [PubMed]
  2. Contessa, J.; Suarez, L.; Kyriakides, T.; Nadzam, G. The influence of surgeon personality factors on risk tolerance: A pilot study. J. Surg. Educ. 2013, 70, 806–812. [Google Scholar] [CrossRef] [PubMed]
  3. Strobel, C.J.; Oldenburg, D.; Steinhäuser, J. Factors influencing defensive medicine-based decision-making in primary care: A scoping review. J. Eval. Clin. Pract. 2023, 29, 529–538. [Google Scholar] [CrossRef] [PubMed]
  4. Garcia-Retamero, R.; Galesic, M. On defensive decision making: How doctors make decisions for their patients. Health Expect. 2014, 17, 664–669. [Google Scholar] [CrossRef] [PubMed]
  5. Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E.J. AI in health and medicine. Nat. Med. 2022, 28, 31–38. [Google Scholar] [CrossRef] [PubMed]
  6. Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
  7. Katwaroo, A.R.; Adesh, V.S.; Lowtan, A.; Umakanthan, S. The diagnostic, therapeutic, and ethical impact of artificial intelligence in modern medicine. Postgrad. Med. J. 2024, 100, 289–296. [Google Scholar] [CrossRef] [PubMed]
  8. Rashidi, H.H.; Pantanowitz, J.; Chamanzar, A.; Fennell, B.; Wang, Y.; Gullapalli, R.R.; Tafti, A.; Deebajah, M.; Albahra, S.; Glassy, E.; et al. Generative Artificial Intelligence in Pathology and Medicine: A Deeper Dive. Mod. Pathol. 2024, 38, 100687. [Google Scholar] [CrossRef] [PubMed]
  9. Bhuyan, S.S.; Sateesh, V.; Mukul, N.; Galvankar, A.; Mahmood, A.; Nauman, M.; Rai, A.; Bordoloi, K.; Basu, U.; Samuel, J. Generative Artificial Intelligence Use in Healthcare: Opportunities for Clinical Excellence and Administrative Efficiency. J. Med. Syst. 2025, 49, 10. [Google Scholar] [CrossRef]
  10. Wachter, R.M.; Brynjolfsson, E. Will Generative Artificial Intelligence Deliver on Its Promise in Health Care? JAMA 2024, 331, 65–69. [Google Scholar] [CrossRef]
  11. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar] [CrossRef]
  12. Introducing Gemini 2.0: Our New AI Model for the Agentic Era. Google. Available online: https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ (accessed on 16 April 2025).
  13. Qwen Team. Qwen2.5-Max: Exploring the Intelligence of Large-Scale MoE Model. Qwen. Available online: https://qwenlm.github.io/blog/qwen2.5-max/ (accessed on 16 April 2025).
  14. Introducing DeepSeek-V3|DeepSeek API Docs. Available online: https://api-docs.deepseek.com/news/news1226 (accessed on 16 April 2025).
  15. Introducing Perplexity Deep Research. Available online: https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research (accessed on 16 April 2025).
  16. Zhang, D.C.; Highhouse, S.; Nye, C.D. Development and validation of the General Risk Propensity Scale (GRiPS). J. Behav. Decis. Mak. 2019, 32, 152–167. [Google Scholar] [CrossRef]
  17. Harrison, J.D.; Young, J.M.; Butow, P.; Salkeld, G.; Solomon, M.J. Is it worth the risk? A systematic review of instruments that measure risk propensity for use in the health setting. Soc. Sci. Med. 2005, 60, 1385–1396. [Google Scholar] [CrossRef] [PubMed]
  18. Oppe, M.; Rand-Hendriksen, K.; Shah, K.; Ramos-Goñi, J.M.; Luo, N. EuroQol Protocols for Time Trade-Off Valuation of Health Outcomes. PharmacoEconomics 2016, 34, 993–1004. [Google Scholar] [CrossRef] [PubMed]
  19. Attema, A.E.; Brouwer, W.B.F. On the (not so) constant proportional trade-off in TTO. Qual. Life Res. 2010, 19, 489–497. [Google Scholar] [CrossRef] [PubMed]
  20. van Nooten, F.E.; Koolman, X.; Busschbach, J.J.V.; Brouwer, W.B.F. Thirty down, only ten to go?! Awareness and influence of a 10-year time frame in TTO. Qual. Life Res. 2014, 23, 377–384. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Average values for each GRiPS statement provided by the tested GenAI systems in Scenario P—non-medical (top), A—medical case with the best prognosis (middle), and B—medical case with the worst prognosis (bottom).
Figure 1. Average values for each GRiPS statement provided by the tested GenAI systems in Scenario P—non-medical (top), A—medical case with the best prognosis (middle), and B—medical case with the worst prognosis (bottom).
Applsci 15 05700 g001
Figure 2. Distribution of TTO-derived risk tolerance across GenAI systems for patients aged Y years. The boxplot represents the willingness to trade lifespan in order to avoid moderate pain (VAS ≥ 4) in RA patients. The mean (×) and individual outliers are shown for each GenAI system. A TTO value of 0 corresponds to continued current therapy.
Figure 2. Distribution of TTO-derived risk tolerance across GenAI systems for patients aged Y years. The boxplot represents the willingness to trade lifespan in order to avoid moderate pain (VAS ≥ 4) in RA patients. The mean (×) and individual outliers are shown for each GenAI system. A TTO value of 0 corresponds to continued current therapy.
Applsci 15 05700 g002
Table 1. List of GRiPS statements and allocated codes.
Table 1. List of GRiPS statements and allocated codes.
AffirmationCode
Take risks to make life more fun.S1
Show your friends you are a true risk taker.S2
Embrace risks in every aspect of your life.S3
Be willing to take a risk, even if it might hurt.S4
Make risk-taking a meaningful part of your life.S5
Dare to make bold, risky decisions.S6
Believe in taking chances.S7
Let risk attract you, not scare you.S8
Table 2. The consistency of each GenAI’s responses to repeated questions in the second phase.
Table 2. The consistency of each GenAI’s responses to repeated questions in the second phase.
EvaluatorScenario 1CV
CHATP12.49%
CHATA13.86%
CHATB16.83%
GEMINIP5.31%
GEMINIA39.15%
GEMINIB11.18%
QWENP9.97%
QWENA11.76%
QWENB13.86%
DEEPSEEKP11.60%
DEEPSEEKA25.90%
DEEPSEEKB19.64%
PERPLEXITYP3.51%
PERPLEXITYA16.24%
PERPLEXITYB18.44%
1 P—general nonmedical; A—Good Prognosis; B—Poor Prognosis.
Table 3. The distribution of total scores among the five GenAI evaluators, separately for each scenario (A, B, and P), in the second phase.
Table 3. The distribution of total scores among the five GenAI evaluators, separately for each scenario (A, B, and P), in the second phase.
Scenario 1Kruskal–Wallis Hp-ValueInterpretation
P14.6870.00512.49%
A4.3390.36213.86%
B5.0890.27818.44%
1 P—general nonmedical, A—Good Prognosis, B—Poor Prognosis.
Table 4. Risk appetite assessed via Time Trade-Off (TTO) method.
Table 4. Risk appetite assessed via Time Trade-Off (TTO) method.
ScenarioChi-Squarep-ValueMean Ranks (Evaluator: Rank)
20 years6.3890.172GPT: 3.44, GEMINI: 2.50, QWEN: 3.44, DEEPSEEK: 2.50, PERPLEXITY: 3.13
40 years1.8840.757GPT: 3.31, GEMINI: 2.94, QWEN: 2.75, DEEPSEEK: 2.63, PERPLEXITY: 3.38
55 years9.4660.05GPT: 2.81, GEMINI: 2.25, QWEN: 2.94, DEEPSEEK: 2.69, PERPLEXITY: 4.31
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Berghea, F.; Andras, D.; Berghea, E.C. Generative Artificial Intelligence and Risk Appetite in Medical Decisions in Rheumatoid Arthritis. Appl. Sci. 2025, 15, 5700. https://doi.org/10.3390/app15105700

AMA Style

Berghea F, Andras D, Berghea EC. Generative Artificial Intelligence and Risk Appetite in Medical Decisions in Rheumatoid Arthritis. Applied Sciences. 2025; 15(10):5700. https://doi.org/10.3390/app15105700

Chicago/Turabian Style

Berghea, Florian, Dan Andras, and Elena Camelia Berghea. 2025. "Generative Artificial Intelligence and Risk Appetite in Medical Decisions in Rheumatoid Arthritis" Applied Sciences 15, no. 10: 5700. https://doi.org/10.3390/app15105700

APA Style

Berghea, F., Andras, D., & Berghea, E. C. (2025). Generative Artificial Intelligence and Risk Appetite in Medical Decisions in Rheumatoid Arthritis. Applied Sciences, 15(10), 5700. https://doi.org/10.3390/app15105700

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop