1. Introduction
In recent years, artificial intelligence (AI) has demonstrated significant potential in medical research, including accelerating drug discovery and predicting disease outcomes [
1]. A notable advancement in AI technology is the release of ChatGPT, a large language model (LLM) developed by OpenAI (San Franscisco, CA, USA) [
2]. ChatGPT enables human-like conversations and has demonstrated promising performance in tasks such as text classification and answering questions [
3]. A recent scoping review focusing on ChatGPT and its role in pharmacy practice revealed not only its usefulness for medication advice, drug dosage calculation, and drug–drug interactions but also highlighted limitations such as variability in accuracy, lack of reproducibility, and challenges in applying it to personalized treatment [
4]. A recent study comparing three LLMs—ChatGPT, Gemini, and Copilot—on common pharmacokinetic problems showed the superior performance of ChatGPT; it achieved the highest number of correct answers [
5]. This finding suggests that LLMs may serve as an educational strategy for teaching pharmacokinetics to clinical pharmacy students and that ChatGPT holds promise as a supportive tool for therapeutic drug monitoring (TDM) operations. Despite these advances, research integrating LLM with pharmacokinetics remains limited, and no studies have reported the use of LLM in clinical TDM.
Vancomycin (VCM) is widely used to treat serious infections such as bacteremia, endocarditis, pneumonia, and meningitis and remains the first-line therapy for methicillin-resistant
Staphylococcus aureus infections [
6,
7,
8]. The efficacy of VCM is closely associated with the ratio of the area under the serum concentration (AUC)–time curve to the minimum inhibitory concentration; however, its dose-dependent nephrotoxicity remains a major concern [
9,
10]. Therefore, TDM is essential to optimize efficacy while minimizing toxicity. Current guidelines recommend AUC-based dosing, assuming a minimum inhibitory concentration (MIC) of 1 mg/L, whereby the AUC/MIC target is simplified to AUC alone [
11,
12]. Furthermore, recent observational studies and meta-analyses have demonstrated that AUC-based dosing reduces nephrotoxicity compared with trough concentration (C
min)-based dosing targeting 15–20 mg/L [
13,
14,
15,
16]. Within the framework of model-informed precision dosing, Bayesian estimation using population pharmacokinetic (PopPK) models enables accurate AUC prediction and supports individualized therapy [
17]. Presently, in clinical practice, TDM platforms typically require manual intervention by healthcare professionals, involving the interpretation of Bayesian-estimated pharmacokinetic parameters and the iterative design of dosing plans to achieve target AUC values. By contrast, integrating ChatGPT with Bayesian estimation could potentially automate the dosing plan design process by directly translating pharmacokinetic parameters into clinically interpretable dosing recommendations. This approach could complement the existing TDM platforms by maintaining individualized dosing strategies while reducing the workload associated with TDM. Therefore, combining Bayesian estimation based on PopPK models that incorporate individual serum concentration data with ChatGPT could be a useful tool for TDM in VCM.
However, ChatGPT includes a hyperparameter, temperature (T), which controls the balance between the randomness and determinism of its output. Higher temperatures generate more diverse and creative responses, whereas lower temperatures yield more deterministic and conservative outputs [
18,
19]. Accuracy and reproducibility are critical for clinical VCM TDM when designing regimens based on Bayesian estimations. Thus, temperature settings may influence the accuracy and reproducibility of ChatGPT-generated dosing recommendations. Therefore, optimizing this hyperparameter is essential for maximizing the utility of ChatGPT for VCM dosing support.
While the broader question of whether AI can be used for dose adjustment remains an important area of investigation, this study was designed with the purpose of serving as a methodological and exploratory analysis focusing on a fundamental aspect of LLM-based decision support, namely, the impact of hyperparameter settings on reproducibility and consistency. We conducted Monte Carlo simulations of virtual patients using a PopPK model of the VCM and evaluated how ChatGPT hyperparameters influenced the reproducibility and output behavior of dosing calculations derived from Bayesian-estimated pharmacokinetic parameters. This study is a methodological and exploratory investigation to assess the feasibility and characteristics of an LLM-based support under controlled simulation conditions.
2. Materials and Methods
2.1. ChatGPT Method
The default ChatGPT model (gpt-4o-mini) [
20] was accessed using the OpenAI API and programmatically controlled using the statistical software R (version 4.5.1, R Foundation for Statistical Computing, Vienna, Austria).
2.2. Generation of a Virtual Patient Dataset
The virtual patient was defined as a male aged 60 years, weighing 70 kg, with a VCM dose of 1000 mg administered twice daily using a 1 h infusion over a 6-day treatment period. Serum creatinine (SCr) level was fixed at 1.0 mg/dL at the initiation of VCM therapy. Creatinine clearance was calculated using the Cockcroft–Gault equation [
21] based on age, sex, weight, and SCr. The dosage and administration were based on the maintenance dose specified in the guidelines, considering renal function and weight [
12]. To isolate the effect of the ChatGPT hyperparameters on reproducibility and output behavior, patient characteristics and dosing conditions were fixed. This approach was adopted to minimize the variability arising from covariates and to allow for a controlled evaluation of the impact of the ChatGPT temperature settings on model outputs.
2.3. Population Pharmacokinetic Model for Monte Carlo Simulations
The population pharmacokinetic parameters for the central compartment volume of distribution (V
1), peripheral compartment volume of distribution (V
2), distribution clearance (Q), and total clearance (CL
VCM) were obtained from healthy Japanese subjects as reported by Yamamoto et al. [
22] (
Table S1). The rate constant for transfer from the central to the peripheral compartment (k
12) was calculated by dividing Q by V
1:k
12 = Q/V
1. The rate constant for the transfer from the peripheral compartment to the central compartment (k
21) was calculated by dividing Q by V
2, where k
21 = Q/V
2. The elimination rate constant (ke) was calculated by dividing CL
VCM by V
1:ke = CL
VCM/V
1. Serum VCM concentrations for 1000 cases were generated using individual values of population pharmacokinetic parameters to reflect inter-individual variability, and log-normally distributed residual errors were added to simulate intra-individual variability. These simulated concentrations were regarded as true values.
2.4. Setting of Cmin Sampling Points
The Cmin values at 48 h (Cmin48h), 96 h (Cmin96h), and 144 h (Cmin144h), generated from the simulated virtual patient dataset, were extracted from the concentration–time profiles and used as input data for Bayesian estimation.
2.5. Bayesian Estimation Settings
Bayesian estimation was performed using the fixed covariates method, in which SCr was fixed throughout the entire VCM dosing period. The estimation incorporated multiple trough concentrations (C
min48h, C
min96h, and C
min144h) obtained at 48, 96, and 144 h after the initiation of VCM therapy, which were jointly used as input data for the Bayesian estimation. The estimated pharmacokinetic parameters were subsequently used to reconstruct the concentration–time profiles for AUC calculation (see
Section 2.7 for details).
2.6. Population Pharmacokinetic Model for Bayesian Estimation of Individual Pharmacokinetic Parameters
For Bayesian estimation, the C
min of the virtual patient was used for the calculations. The population pharmacokinetic parameters for the steady-state volume of distribution (V
ss), k
12, k
21, and CL
VCM were obtained from Japanese data reported by Yasuhara et al. [
23] (
Table S2). V
1 was calculated using V
ss, k
12, and k
21 as V
1 = k
21 × V
ss/(k
12 + k
21). ke was calculated by dividing CL
VCM by V
1:ke = CL
VCM/V
1.
2.7. Method for Calculating AUC
Blood sampling points corresponding to those used in a previous clinical trial [
24] were applied to each dosing interval, using an eight-point sampling scheme within one dosing interval (τ = 12 h), to calculate AUC. Because VCM was administered twice daily, the AUC
0–24 was calculated as the sum of AUCs over two consecutive dosing intervals using the trapezoidal method. True AUC values were calculated using the trapezoidal method by extracting serum VCM concentrations at eight sampling points from the VCM concentration–time profile generated when the dataset was created. The predicted AUC values were calculated from the predicted VCM concentration–time profile based on the pharmacokinetic parameters estimated by Bayesian estimation, and the trapezoidal method was applied using the VCM concentrations at the same eight sampling points.
2.8. Prompt Syntax in ChatGPT
The data input to ChatGPT was obtained using R version 4.5.1 and RStudio (version 2026.01.1, Posit PBC, Boston, MA, USA). The full prompt template is presented in
Table S3. The prompts were structured in a stepwise format. Serum VCM concentrations and pharmacokinetic parameters were estimated externally by Bayesian estimation prior to being provided to ChatGPT. CL
VCM and V
1, obtained from the Bayesian estimation, were entered into the prompt as input variables. ChatGPT did not perform Bayesian estimation or independently estimate pharmacokinetic parameters. The dose calculation procedure and the AUC calculation formula were explicitly predefined in the prompt, and ChatGPT was instructed to execute these calculations in a rule-based manner. Within the ChatGPT-based procedure, AUC was not calculated using the trapezoidal method but was instead calculated analytically using the predefined formula AUC
0–24 = (Dose × 2)/CL
VCM. The ChatGPT output format was standardized by specifying a unified response structure within the prompt. The administration schedule, SCr values measured at each dosing time and at the final blood sampling point, C
min at the final blood sampling point, predicted C
min at the final blood sampling point, and estimated pharmacokinetic parameters were entered individually into ChatGPT in the API using R version 4.5.1. The administration design using ChatGPT was compiled after setting a specified response format.
2.9. Hyperparameter Settings
Temperature is a hyperparameter that controls the degree of randomness in model outputs, with lower values producing more deterministic responses. In this study, three temperature settings (T = 0.1, 0.5, and 1.0) were used for the reproducibility analysis, whereas T = 0.1 was used to assess the usefulness of ChatGPT for dose design. All other ChatGPT hyperparameters were held constant as follows: maximum tokens = 4096, seed = 1; top p = 1; frequency penalty = 0; and presence penalty = 0.
2.10. Reproducibility and Evaluation of the Usefulness of ChatGPT for Dosage Design
The overall workflow of virtual patient generation via Monte Carlo simulation, Bayesian estimation, ChatGPT-based dosage design, and evaluation of the reproducibility and usefulness of the ChatGPT recommendations is illustrated in
Figure 1.
A total of 1000 independent virtual patients were generated according to the procedures described in
Section 2.2,
Section 2.3,
Section 2.4,
Section 2.5 and
Section 2.6. For the same 1000 patients, the recommended doses and calculated AUCs generated by ChatGPT were obtained five times (runs 1–5) under three temperature conditions (T = 0.1, 0.5, and 1.0), according to the procedures described in
Section 2.8 and
Section 2.9.
Reproducibility at the individual-patient level across runs (runs 1–5) was evaluated using the mode percentage, and the results for T = 0.1, T = 0.5, and T = 1.0 were compared as paired data (identical patients). The mode percentage represents the proportion of identical outputs among the five repeated runs under each temperature condition, reflecting the reproducibility of the ChatGPT-generated results rather than the continuous nature of the calculated variables. The mode percentage was calculated using Equations (1) and (2):
Because the reproducibility was assessed across five repeated runs, the possible values of the mode percentages were discrete and limited to 20%, 40%, 60%, 80%, and 100%.
The mode-calculated AUC from the five runs was adopted for each patient, and the target attainment rate of the AUC was calculated. The dosing regimens generated by ChatGPT were determined using prompts based on the calculated AUC. According to guidelines, an AUC of 400–600 mg·h/L is considered the target range [
11,
12]. Therefore, for each virtual patient, the target attainment rate of the ChatGPT-calculated AUC was determined using Equation (3).
Here, the number of target attainment cases using the calculated AUC is the number of cases with 400 ≤ AUC ≤ 600 mg·h/L.
The pre-optimization AUC (fixed-dose regimen) was defined as the AUC calculated from the simulated serum VCM concentrations 120–144 h after the initiation of VCM therapy under the initial fixed-dose regimen (1000 mg every 12 h). This pre-optimization AUC served as the baseline for comparison with the AUCs obtained using the ChatGPT-recommended regimens. To evaluate the clinical applicability of the dosing regimens generated by ChatGPT, the mode-recommended dosing schedule from the five runs at T = 0.1 was applied to the same virtual patient dataset using the pharmacokinetic parameters employed in the simulation based on the population pharmacokinetic model reported by Yamamoto et al. [
22]. The time course of serum VCM concentrations from day 7 onward was calculated, and the model-simulated AUC was determined. The post-optimization AUC (ChatGPT-guided regimen) was evaluated at 216–240 h after dose adjustment at 144 h. The initial fixed-dose regimen was administered for up to 144 h, after which the dose was adjusted based on the ChatGPT recommendations and maintained thereafter. The AUC values used for the evaluation, including the pre-optimization AUC and post-optimization AUC, were calculated using the trapezoidal method, as described in
Section 2.7. Target attainment rates were calculated using Equations (4) and (5):
Here, the number of target attainment cases using pre-optimization AUC and the number of target attainment cases using post-optimization AUC are the number of cases with 400 ≤ AUC ≤ 600 mg·h/L.
2.11. Evaluation of Prediction Accuracy for Cmin and AUC and Estimation Accuracy of Pharmacokinetic Parameter
The group in which the post-optimization AUC fell within the range of 400–600 mg·h/L was defined as the within-target-range group (IN group), whereas the group in which the post-optimization AUC fell outside this range was defined as the out-of-target-range group (OUT group). The predictability of C
min and AUC and the estimation accuracy of the pharmacokinetic parameters were evaluated using mean absolute prediction error (MAPE) and mean prediction error (MPE). The MAPE and MPE values served as indicators for assessing the prediction accuracy, estimation accuracy, prediction bias, and estimation bias and were calculated using Equations (6) and (7), respectively.
In this study, the true values refer to the ground-truth serum VCM concentrations, AUCs, and pharmacokinetic parameters generated using Monte Carlo simulations.
The prediction accuracy and bias of the Bayesian estimated serum concentrations and the estimation accuracy and bias of the Bayesian estimated pharmacokinetic parameters were compared and evaluated between the two groups.
2.12. Analytical Environment
The generation of individual pharmacokinetic parameters reflecting inter-individual variability, the generation of residuals reflecting intra-individual variability, the simulation of serum VCM concentrations, the Bayesian estimation of individual pharmacokinetic parameters, and AUC calculations were performed using R version 4.5.1. Individual pharmacokinetic parameters and residuals reflecting inter- and intra-individual variability were generated using the rnorm function with a fixed random seed. Serum VCM concentration simulations were performed using the rxode2 package, and serum VCM concentrations were obtained using the RxODE function. Bayesian estimation was performed using the nlmixr2 package, and pharmacokinetic parameters and predicted serum VCM concentrations were obtained using the nlmixr function. AUC calculations were performed using the DescTools package, and AUC values were obtained from serum VCM concentrations using the AUC function.
2.13. Statistical Analysis
Appropriate and inappropriate classifications within paired data were compared using McNemar’s test. The Mann–Whitney U test was used to evaluate the accuracy and variability of the Bayesian estimation between the IN and OUT groups. All statistical data were analyzed using R version 4.5.1. A p value of <0.05 was considered statistically significant.
4. Discussion
In this study, we demonstrate the feasibility of TDM using ChatGPT based on Bayesian estimation results in a VCM simulation study. We found that setting the hyperparameter temperature to 0.1 improved the reproducibility of dosage design and that administration design using ChatGPT based on Bayesian estimation results could convert AUCs that failed to achieve target attainment into AUCs within the target range. These findings suggest that the temperature adjustment of ChatGPT is essential for developing ChatGPT-based TDM support tools that optimize personalized treatments. In addition, the use of Bayesian estimation results in a more effective administrative design.
Reproducibility is often a major concern in AI-related research, and reproducibility issues have been reported in medical testing and model-based evaluation [
25]. In addition, it has been reported that reproducibility in the field of pharmacokinetic research using ChatGPT remains a challenge in the construction of PopPK models [
26]. When the Bayesian estimation results were incorporated into the prompts, the reproducibility of ChatGPT’s recommended dosage and the calculated AUC, assessed using the mode percentage, varied depending on the temperature setting, with higher mode percentage values observed at T = 0.1 than at T = 0.5 and 1.0, as shown in
Figure 2. Previous medical AI research has often used binary agreement metrics to assess response reproducibility, categorizing repeated outputs as identical or non-identical [
27,
28]. In this study, the recommended dose was a discretized continuous variable ranging from 100 to 1500 mg in 100 mg increments. Additionally, the calculated AUC is a continuous variable with an integer value. To the best of our knowledge, the reproducibility of quantitative outputs such as dose and AUC across temperature settings has not been systematically evaluated in previous LLM studies. Therefore, we evaluated the reproducibility across temperature settings using the mode percentage, which captures the frequency of identical outputs and allows for a quantitative comparison of temperature-dependent variations. The differences in the calculation results observed across the temperature settings in this study are likely attributable to the internal calculation errors generated by ChatGPT. Lower temperature settings are known to promote more deterministic responses [
18,
19]. Consistent with this, ChatGPT exhibited more deterministic calculation behavior at T = 0.1 than at T = 0.5 and T = 1.0, resulting in higher reproducibility, as reflected by increased mode percentages for the recommended dosage and calculated AUC. Despite these differences in reproducibility, the target attainment rates of the calculated AUCs were comparable across temperature settings, and the AUC values remained within the target range of 400–600 mg·h/L. Although statistical significance was not formally tested, these differences are minimal and unlikely to be clinically meaningful. In contrast, reproducibility was clearly improved at lower temperature settings, which may be clinically relevant for ensuring consistent outputs in TDM decision support. Taken together, these findings indicate that setting the temperature to 0.1 is necessary to ensure reproducibility in the development of clinical TDM support tools.
To further interpret the modeling framework used in this study, we intentionally fixed patient covariates such as sex, age, body weight, and renal function to reduce the variability arising from known covariates, thereby enabling a clearer evaluation of the effect of ChatGPT temperature settings on the output behavior. Interindividual variability was preserved using variability parameters derived from the original PopPK model, reflecting unexplained variability beyond the included covariates. This approach enables the construction of a controlled simulation environment while maintaining a realistic level of pharmacokinetic variability. This controlled design is consistent with methodological studies aimed at evaluating model behavior under simplified conditions [
29]. In addition, different population pharmacokinetic models were used for the simulation and Bayesian estimation to better reflect real-world clinical situations, where the true underlying pharmacokinetic model is unknown, and model misspecification is unavoidable [
30]. To further enhance generalizability, future studies incorporating interindividual variability in patient covariates will be warranted.
We further simulated the virtual patient by administering the ChatGPT-recommended dose at T = 0.1 and assessed whether the post-optimization AUC values were within the target range compared with the pre-optimization AUC values. As patient characteristics and renal function were fixed in the virtual patient setting, we also simulated a guideline-based regimen of 1000 mg every 12 h. The low target achievement rate before optimization of 25.5% is consistent with previous reports that empirical or non-interventional vancomycin administration strategies achieve an AUC target of only approximately 30%, reflecting substantial inter-individual variability and the limitations of fixed dosing without TDM [
31,
32]. Therefore, we believe that the values obtained in this simulation are consistent with the real-world evidence that the target achievement rate before individual dosing optimization does not reach the optimal value. Although the pre-optimization AUC value had a target attainment rate of only 25.5%, the post-optimization AUC was 71.5%, indicating that ChatGPT-guided TDM based on Bayesian estimation achieved better dose adjustment than the fixed guideline-based dose. In contrast, previous studies using Bayesian estimation software reported a target attainment rate of 62.0% [
32]. Thus, the results of the present study may be considered consistent with real-world clinical performance. In other words, it was suggested that setting T = 0.1 in ChatGPT-based TDM using Bayesian estimation results would improve the target attainment of AUC compared to guideline-based regimens while maintaining the reproducibility of the administration design shown in
Table 1. Furthermore, we investigated the factors causing the post-optimization AUC to fall outside the target range. The calculated AUC based on ChatGPT, which utilizes Bayesian-estimated clearance, achieved a target attainment rate exceeding 90%. However, when the recommended dose was applied to the post-optimization forward pharmacokinetic simulation and evaluated using the simulated AUC over 216–240 h, the target attainment rate decreased to 71.5% (T = 0.1). Importantly, the two AUC metrics represent fundamentally different concepts. The calculated AUC based on ChatGPT reflects a theoretical value derived directly from Bayesian-estimated pharmacokinetic parameters and deterministic calculations, whereas the post-optimization AUC reflects the realized exposure after incorporating variability and model uncertainty through simulation. Importantly, this study intentionally employed different PopPK models for the simulation and Bayesian estimation, resulting in a model mismatch by design. This model mismatch is considered the primary source of the discrepancy observed in this study, leading to a certain degree of bias in the Bayesian estimation. The MAPE and MPE of C
min, AUC, and CL
VCM between the two groups revealed that this discrepancy was due to the accuracy and variability of Bayesian estimation. If the accuracy of the pharmacokinetic parameters estimated using Bayesian estimation is reduced, even if ChatGPT calculates the AUC using these parameters and determines that it is within an appropriate range, the accuracy of the calculated AUC will inevitably be reduced. This discrepancy reflects error propagation from the Bayesian parameter estimation to the forward concentration–time simulation rather than a limitation of the arithmetic execution of ChatGPT. Notably, the majority of patients in the OUT group were beyond the upper target limit, indicating a tendency toward AUC overestimation (
Figure 3). This can be attributed to the overestimation of clearance by Bayesian estimation in the OUT group, which led ChatGPT to recommend higher doses to achieve an AUC within the target range. Consequently, the resulting post-optimization AUC values were likely overestimated. The accuracy of serum concentration predictions and pharmacokinetic parameter estimations using Bayesian estimation is also important when performing ChatGPT-based TDM. Thus, the observed deviation in the post-optimization AUC can be attributed to the accuracy of the Bayesian parameter estimation rather than to the computational performance of ChatGPT itself.
This study has some limitations. First, data were virtually generated under controlled conditions. This study aimed to isolate and evaluate the influence of temperature parameters on ChatGPT behavior and reproducibility, intentionally excluding variations in patient characteristics such as age, sex, dosing regimen, renal function, and volume of distribution. Although this controlled design enhances internal validity, it may limit its applicability to complex patient backgrounds and the dynamic clinical conditions encountered in real-world practice. More importantly, this study was designed as a methodological and exploratory investigation to evaluate the behavior and reproducibility of ChatGPT under controlled simulation conditions. One of the key strengths of this simulation-based approach is that the true pharmacokinetic parameters and AUC are known, allowing a direct comparison between the true values and those estimated by Bayesian methods and ChatGPT-guided dose optimization, which is not feasible in real-world clinical settings. However, as this study was conducted under controlled simulation conditions, the findings should be interpreted with caution in terms of real-world generalizability. Future studies incorporating variability in patient characteristics, such as age, body weight, and renal function, are warranted to better reflect clinical practice. Second, because the reproducibility was assessed based on five repeated runs, the resolution of the mode-based metrics was limited. This choice reflects computational feasibility and is considered an exploratory study. Future research should incorporate more repeated runs to enable a more accurate assessment. Third, because the characteristics and renal functions of the virtual patients were fixed, the data complexity was relatively low, potentially favoring the T = 0.1 setting. Validation using more complex real-world clinical data, such as data from patients undergoing hemodialysis, is necessary. Dose optimization was performed at a fixed time point (day 6) after three concentration measurements were obtained, which may not reflect the variability in clinical decision timing in real-world practice. In addition, no direct comparison with the established Bayesian TDM software platform was performed. Therefore, the relative performance of this approach compared to standard-of-care methods remains unclear. Future studies should include direct comparisons with existing Bayesian TDM tools using real-world clinical data to further evaluate the behavior and feasibility of LLM-based approaches, as well as the effect of temperature.