Large Language Models for Intraoperative Decision Support in Plastic Surgery: A Comparison between ChatGPT-4 and Gemini

Background and Objectives: Large language models (LLMs) are emerging as valuable tools in plastic surgery, potentially reducing surgeons’ cognitive loads and improving patients’ outcomes. This study aimed to assess and compare the current state of the two most common and readily available LLMs, Open AI’s ChatGPT-4 and Google’s Gemini Pro (1.0 Pro), in providing intraoperative decision support in plastic and reconstructive surgery procedures. Materials and Methods: We presented each LLM with 32 independent intraoperative scenarios spanning 5 procedures. We utilized a 5-point and a 3-point Likert scale for medical accuracy and relevance, respectively. We determined the readability of the responses using the Flesch–Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) score. Additionally, we measured the models’ response time. We compared the performance using the Mann–Whitney U test and Student’s t-test. Results: ChatGPT-4 significantly outperformed Gemini in providing accurate (3.59 ± 0.84 vs. 3.13 ± 0.83, p-value = 0.022) and relevant (2.28 ± 0.77 vs. 1.88 ± 0.83, p-value = 0.032) responses. Alternatively, Gemini provided more concise and readable responses, with an average FKGL (12.80 ± 1.56) significantly lower than ChatGPT-4′s (15.00 ± 1.89) (p < 0.0001). However, there was no difference in the FRE scores (p = 0.174). Moreover, Gemini’s average response time was significantly faster (8.15 ± 1.42 s) than ChatGPT’-4′s (13.70 ± 2.87 s) (p < 0.0001). Conclusions: Although ChatGPT-4 provided more accurate and relevant responses, both models demonstrated potential as intraoperative tools. Nevertheless, their performance inconsistency across the different procedures underscores the need for further training and optimization to ensure their reliability as intraoperative decision-support tools.


Introduction
The introduction of artificial intelligence (AI) into medicine has revolutionized medical practice and patient management by offering precise and individualized healthcare delivery.The integration of deep-learning (DL) techniques into natural language processing (NLP) and the availability of vast amounts of public datasets has led to the development of large language models (LLMs) [1].Using transformer architectures, LLMs can recognize, summarize, translate, predict, and generate text-based content from the knowledge gained from these extensive datasets [2].With the increasing amount of medical data and the complexity of clinical decision-making, LLMs can be pivotal for improving the overall quality and efficiency of healthcare as they can assist physicians in making timely, informed decisions [3].
As in any other surgical specialty, plastic surgeons must make time-sensitive decisions that have a significant impact on a patient's outcome and safety.They must maintain upto-date and robust medical knowledge as well as solid cognitive and mechanical skills [4].However, in one study of surgical errors, cognitive errors contributed to over half of the adverse events recorded, especially for less-experienced and sleep-deprived surgeons [4][5][6].
AI models can quickly process large quantities of data and demonstrate superior prediction and classification for decision-making [4,7], an advantageous ability intraoperatively.In their scoping review, Navarrete and Hashimoto [5] identified that the three most common uses of AI for intraoperative decision support were: (1) increasing the information available to surgeons, including retrieving similar cases; (2) accelerating intraoperative pathology, including tumor margin mapping, tumor classification, and tissue identification; and (3) recommending surgical steps.In theory, the former and the latter can easily be performed with the current LLMs.
LLMs can process audiovisual and multimodal data and learn their semantic relationships, enhancing machines' capabilities to understand and generate human-like language [1,8,9].In a study assessing ChatGPT's medical accuracy and comprehensiveness, the model got >57% of the questions at least nearly all correct and 79% at least adequate [3].In surgery, Oh et al. [10] showed that ChatGPT-4 attained a score of 76.4% in the Korean General Surgery Board Exam, with pediatric and breast knowledge scoring >80%.LLMs can help surgeons make more accurate decisions during surgery by providing alternative solutions to non-typical scenarios based on similar case studies or reference materials.For less-experienced surgeons, LLMs can be helpful with anatomical identification or variations, as well as next-surgical-step guidance [11][12][13].By training LLMs on more extensive clinical data and medical literature, we might be able to develop AI systems that can support surgeons with intraoperative queries and difficulties, thereby reducing the cognitive load in the OR and contributing to improved patient safety [9,14].
In this study, we aim to evaluate and compare the current state of the two most common and readily available LLMs, Open AI's ChatGPT-4 and Google's Gemini, in providing intraoperative decision support in plastic and reconstructive surgery procedures without utilizing a retrieval-augmented generation (RAG) approach.Atkinson et al. [14] previously evaluated ChatGPT-4 for intraoperative support for complications in the Deep Inferior Perforator flap.Building on their research, we evaluated the potential of ChatGPT-4 and Gemini as adjunctive tools intraoperatively by analyzing their generalizability in common procedures in the major fields of plastic surgery: cosmetic, pediatric, craniofacial, microsurgery, general plastic surgery, and hand surgery.

Study Design
To evaluate the generalizability of the LLMs in plastic surgery, we created 32 scenarios ending in a question addressing surgical planning, general anatomy, surgical procedure knowledge, and the ability to provide solutions and alternatives for possible complications in 5 distinct procedures: breast augmentation (n = 6), complete cleft lip repair (n = 6), lymphaticovenous bypass (n = 8), mandibular reconstruction with fibula osteoseptocutaneous flap and osteomyocutaneous peroneal-artery-based combined flap harvest (n = 6), and carpal tunnel release (n = 6).Since the scenarios were not designed as if they were about a single patient per procedure, each was prompted individually in separate chats.All the questions were asked just once.After every scenario was presented to one model, the other was tested.Every scenario started with the statement "I am a board-certified plastic surgeon" and was narrated with medical terms to ensure that the models adequately narrowed their responses to the scenario.Additionally, the questions were brief in an attempt to simulate the time-sensitive setting of the OR.In Figure 1, we display an example of the scenarios presented to the LLMs, and in the Supplemental Files, we present the complete list.
carpal tunnel release (n = 6).Since the scenarios were not designed as if they were about a single patient per procedure, each was prompted individually in separate chats.All the questions were asked just once.After every scenario was presented to one model, the other was tested.Every scenario started with the statement "I am a board-certified plastic surgeon" and was narrated with medical terms to ensure that the models adequately narrowed their responses to the scenario.Additionally, the questions were brief in an attempt to simulate the time-sensitive setting of the OR.In Figure 1, we display an example of the scenarios presented to the LLMs, and in the Supplemental Files, we present the complete list.

Evaluation Tools
To evaluate the medical accuracy of the answers retrieved, we employed a 5-point Likert scale with the following values: 1 point: completely incorrect, the answer is entirely wrong and contradicts established medical knowledge; 2 points: partially incorrect, the answer has some validity but contains significant errors or misleading information; 3 points: partially correct and incorrect, there is a mix of correct and incorrect information; 4 points: partially correct, the answer contains some correct information but might be missing details or have minor inaccuracies; and 5 points: completely correct, the answer matches the information in reference textbooks and known practice.Alternatively, since the information could be medically accurate but still not pertinent or valuable for the intraoperative setting, we evaluated the relevance of the responses.We utilized a 3-point Likert scale where 1 point was irrelevant, meaning the answer did not provide useful information for the surgeon or the team; 2 points was somewhat relevant, meaning that the answer offered some general information but lacked specific guidance for the surgical situation; and 3 points was relevant, implying that the answer directly addressed the surgical scenario and provided helpful, actionable steps for the surgical team.For accuracy and relevance, we used as the ground truth surgical procedures from textbooks such as Plastic Surgery: 6-Volume Set, 5th Edition, Grabb and Smith's Plastic Surgery, and Green's Hand

Evaluation Tools
To evaluate the medical accuracy of the answers retrieved, we employed a 5-point Likert scale with the following values: 1 point: completely incorrect, the answer is entirely wrong and contradicts established medical knowledge; 2 points: partially incorrect, the answer has some validity but contains significant errors or misleading information; 3 points: partially correct and incorrect, there is a mix of correct and incorrect information; 4 points: partially correct, the answer contains some correct information but might be missing details or have minor inaccuracies; and 5 points: completely correct, the answer matches the information in reference textbooks and known practice.Alternatively, since the information could be medically accurate but still not pertinent or valuable for the intraoperative setting, we evaluated the relevance of the responses.We utilized a 3-point Likert scale where 1 point was irrelevant, meaning the answer did not provide useful information for the surgeon or the team; 2 points was somewhat relevant, meaning that the answer offered some general information but lacked specific guidance for the surgical situation; and 3 points was relevant, implying that the answer directly addressed the surgical scenario and provided helpful, actionable steps for the surgical team.For accuracy and relevance, we used as the ground truth surgical procedures from textbooks such as Plastic Surgery: 6-Volume Set, 5th Edition, Grabb and Smith's Plastic Surgery, and Green's Hand Surgery [29][30][31][32][33]. Three independent authors analyzed and graded the responses; the most common grade was utilized.
Because of the nature of the intraoperative setting, the responses provided should ideally be short and easy to read.We used the Flesch-Kincaid Grade Level (FKGL) and the Flesch Reading Ease (FRE) score to assess the readability and verbosity.The FKGL calculates a text's approximate reading grade level, where a score of 8 indicates that the reader needs a grade 8 reading level or above to understand.The FRE provides a score between 1 and 100, with higher scores meaning that the document is easier to read.Both tests take into consideration the number of sentences, words, and syllables to produce a score, thus measuring verbosity [34].
Lastly, we measured the response time for each answer.While LLMs can provide almost instantaneous responses, we wanted to analyze the actual response time and compare it between the two models.Emphasizing the time-sensitive nature of the OR, we wanted to evaluate the consistency of the models in providing timely responses.We timed each response from when the prompt was sent to when the LLM finished providing the complete answer.

Statistical Analysis
We calculated and charted the mean, mode, standard deviation (SD), and range for all the evaluated metrics of the models' responses using a Microsoft Excel spreadsheet ((Version 2403 Build 16.0.17425.20236)64-bit).We used the Mann-Whitney U test to compare the models' accuracy and relevance.For the readability and response time, we used a twosample, unpaired, bilateral Student's t-test.The Mann-Whitney U was calculated manually, while the Student's t-test was calculated using Microsoft Excel's statistical package.We considered a p-value < 0.05 to be statistically significant.

Medical Accuracy
Overall, ChatGPT-4's responses were significantly more accurate than Gemini's (p = 0.022).ChatGPT's average mean score was 3.59 ± 0.84, with responses ranging from 2 to 5 points, and 56% of them were at least partially correct.Conversely, Gemini averaged 3.13 ± 0.83, with only 28% of its responses at least partially correct (≥4 points), 59% partially correct and incorrect (3 points), and scored as low as 1 point (completely incorrect) (Figure 2).ChatGPT-4 outperformed Gemini in all the procedures except for cheiloplasty, where 66% of Gemini's and 50% of GPT's responses scored three points; both got only one response partially correct and not even one entirely correct.However, the difference was not statistically significant (p = 0.344).
Because of the nature of the intraoperative setting, the responses provided shoul ideally be short and easy to read.We used the Flesch-Kincaid Grade Level (FKGL) an the Flesch Reading Ease (FRE) score to assess the readability and verbosity.The FKG calculates a text's approximate reading grade level, where a score of 8 indicates that th reader needs a grade 8 reading level or above to understand.The FRE provides a scor between 1 and 100, with higher scores meaning that the document is easier to read.Bot tests take into consideration the number of sentences, words, and syllables to produce score, thus measuring verbosity [34].
Lastly, we measured the response time for each answer.While LLMs can provide a most instantaneous responses, we wanted to analyze the actual response time and compar it between the two models.Emphasizing the time-sensitive nature of the OR, we wanted t evaluate the consistency of the models in providing timely responses.We timed each re sponse from when the prompt was sent to when the LLM finished providing the complet answer.

Statistical Analysis
We calculated and charted the mean, mode, standard deviation (SD), and range fo all the evaluated metrics of the models' responses using a Microsoft Excel spreadshee ((Version 2403 Build 16.0.17425.20236)64-bit).We used the Mann-Whitney U test to com pare the models' accuracy and relevance.For the readability and response time, we use a two-sample, unpaired, bilateral Student's t-test.The Mann-Whitney U was calculate manually, while the Student's t-test was calculated using Microsoft Excel's statistica package.We considered a p-value < 0.05 to be statistically significant.

Medical Accuracy
Overall, ChatGPT-4's responses were significantly more accurate than Gemini's (p 0.022).ChatGPT's average mean score was 3.59 ± 0.84, with responses ranging from 2 to points, and 56% of them were at least partially correct.Conversely, Gemini averaged 3.1 ± 0.83, with only 28% of its responses at least partially correct (≥4 points), 59% partiall correct and incorrect (3 points), and scored as low as 1 point (completely incorrect) (Figur 2).ChatGPT-4 outperformed Gemini in all the procedures except for cheiloplasty, wher 66% of Gemini's and 50% of GPT's responses scored three points; both got only one re sponse partially correct and not even one entirely correct.However, the difference wa not statistically significant (p = 0.344).ChatGPT-4's mean score of 4.00 ± 0.63 for breast augmentation was significantly better than Gemini's 3 ± 1.00, with a p-value of 0.046.On average, 83% of ChatGPT-4's responses were at least partially correct, while 50% of Gemini's were partially correct and incorrect, 33% were partially accurate, and none were completely correct.For the lymphovenous bypass procedure, not only was there no significant difference (p = 0.200) but the performance was the most similar, with ChatGPT-4 averaging 3.75 ± 0.46 and Gemini 3.50 ± 0.53.Nevertheless, 75% of the former's responses were partially correct.In comparison, only 50% of Gemini's had the same result.No model provided either partially or incorrect answers or completely accurate responses.
There was no significant difference in the mandibular reconstruction with the fibular osteoseptocutaneous flap procedure (p = 0.149).However, 66% of ChatGPT-4's responses were at least partially correct, with an average score of 4.00 ± 1.26, while 50% of Gemini's were partially correct and incorrect, with a mean score of 3.17 ± 1.33.Notably, this was the only procedure where Gemini provided a completely correct answer.For carpal tunnel release, the average accuracy score for ChatGPT-4 was 3.33 ± 0.52, and for Gemini, 2.83 ± 0.41.A total of 83% of Gemini's responses were partially correct and incorrect; meanwhile, 100% of ChatGPT-4's were either partially correct and incorrect or partially correct.However, there was no statistically significant difference, with a p-value of 0.100.An overview of the models' performance per procedure is shown in Figure 3.
ChatGPT-4's mean score of 4.00 ± 0.63 for breast augmentation was significantly bet ter than Gemini's 3 ± 1.00, with a p-value of 0.046.On average, 83% of ChatGPT-4's re sponses were at least partially correct, while 50% of Gemini's were partially correct and incorrect, 33% were partially accurate, and none were completely correct.For the lympho venous bypass procedure, not only was there no significant difference (p = 0.200) but the performance was the most similar, with ChatGPT-4 averaging 3.75 ± 0.46 and Gemini 3.50 ± 0.53.Nevertheless, 75% of the former's responses were partially correct.In comparison only 50% of Gemini's had the same result.No model provided either partially or incorrec answers or completely accurate responses.
There was no significant difference in the mandibular reconstruction with the fibular osteoseptocutaneous flap procedure (p = 0.149).However, 66% of ChatGPT-4's responses were at least partially correct, with an average score of 4.00 ± 1.26, while 50% of Gemini's were partially correct and incorrect, with a mean score of 3.17 ± 1.33.Notably, this was the only procedure where Gemini provided a completely correct answer.For carpal tunne release, the average accuracy score for ChatGPT-4 was 3.33 ± 0.52, and for Gemini, 2.83 ± 0.41.A total of 83% of Gemini's responses were partially correct and incorrect; meanwhile 100% of ChatGPT-4's were either partially correct and incorrect or partially correct.How ever, there was no statistically significant difference, with a p-value of 0.100.An overview of the models' performance per procedure is shown in Figure 3.

Relevance
ChatGPT-4 significantly outperformed Gemini in terms of the answers' relevance, with a p-value of 0.032.ChatGPT-4's responses averaged 2.28 ± 0.77, ranging from irrelevant to relevant (1-3 points), with 47% being relevant.On the other hand, Gemini's answers averaged 1.88 ± 0.83, and although they similarly ranged from 1 to 3 points, 40% were irrelevant (Figure 4).Similar to the accuracy, cheiloplasty was the only procedure where Gemini (1.67 ± 0.82) outperformed ChatGPT-4 (1.33 ± 1.00), providing 50% irrelevant responses and one relevant response.Conversely, ChatGPT-4 provided 66% irrelevant answers, and the rest were somewhat relevant.Nevertheless, there was no significant difference, with a p-value of 0.260.irrelevant answers, and the rest were somewhat relevant.Nevertheless, there was no significant difference, with a p-value of 0.260.While ChatGPT-4 was superior for the rest of the procedures, there was a significant difference for only carpal tunnel release (p = 0.015).The model achieved a mean score of 2.33 ± 0.82, and 83% of the responses were at least somewhat relevant.Gemini averaged 1.17 ± 0.41, and 83% of the responses were irrelevant.In terms of breast augmentation, ChatGPT-4's mean score was 2.50 ± 0.55, 50% of its responses were relevant, and the other 50% were somewhat relevant.In contrast, Gemini averaged 1.83 ± 0.75 and provided somewhat relevant or irrelevant responses in 83% of the scenarios.However, there was no statistically significant difference (p = 0.075).
Once more, the models' performance was the most similar in the lymphovenous bypass procedure, with mean scores of 2.75 ± 0.46 and 2.50 ± 0.76 for ChatGPT-4 and Gemini, respectively.There was no significant difference between the two (p = 0.299).Nonetheless, ChatGPT-4 proved its superiority by providing relevant responses in six out of eight scenarios and no irrelevant responses.Gemini, on the other side, provided five relevant responses and one irrelevant response.In the mandibular reconstruction with fibular osteoseptocutaneous flap procedure, ChatGPT-4 had an average score of 2.33 ± 0.82 and retrieved relevant responses 50% of the time.Conversely, Gemini averaged 2.00 ± 0.89 and provided relevant responses 33% of the time.There was no significant difference for this procedure (p = 0.260) (Figure 5).While ChatGPT-4 was superior for the rest of the procedures, there was a significant difference for only carpal tunnel release (p = 0.015).The model achieved a mean score of 2.33 ± 0.82, and 83% of the responses were at least somewhat relevant.Gemini averaged 1.17 ± 0.41, and 83% of the responses were irrelevant.In terms of breast augmentation, ChatGPT-4's mean score was 2.50 ± 0.55, 50% of its responses were relevant, and the other 50% were somewhat relevant.In contrast, Gemini averaged 1.83 ± 0.75 and provided somewhat relevant or irrelevant responses in 83% of the scenarios.However, there was no statistically significant difference (p = 0.075).
Once more, the models' performance was the most similar in the lymphovenous bypass procedure, with mean scores of 2.75 ± 0.46 and 2.50 ± 0.76 for ChatGPT-4 and Gemini, respectively.There was no significant difference between the two (p = 0.299).Nonetheless, ChatGPT-4 proved its superiority by providing relevant responses in six out of eight scenarios and no irrelevant responses.Gemini, on the other side, provided five relevant responses and one irrelevant response.In the mandibular reconstruction with fibular osteoseptocutaneous flap procedure, ChatGPT-4 had an average score of 2.33 ± 0.82 and retrieved relevant responses 50% of the time.Conversely, Gemini averaged 2.00 ± 0.89 and provided relevant responses 33% of the time.There was no significant difference for this procedure (p = 0.260) (Figure 5).

Readability
Gemini's responses were more readable and concise, most of the time also b shorter.This was shown by Gemini's significantly lower FKGL mean of 12.80 ± 1.56 c pared to ChatGPT-4's mean of 15.00 ± 1.89, with a p-value < 0.0001.Gemini's superi

Time of Response
Gemini significantly outperformed ChatGPT-4 in providing timely responses, with a p-value < 0.0001.The average response time for Gemini was 8.15 ± 1.42 s; meanwhile, ChatGPT-4's average was 13.70 ± 2.87 s.For all the procedures, Gemini retrieved significantly faster responses than ChatGPT-4, with p-values < 0.001 for breast augmentation, lymphovenous bypass, and mandibular reconstruction and a p-value of 0.003 for cheiloplasty and 0.016 for carpal tunnel release.Gemini's fastest performance was 4.78 s, provided in a mandibular reconstruction scenario, and its slowest was 11.13 s for a cleft lip repair question.Conversely, ChatGPT-4's fastest response was provided for a carpal tunnel release scenario in 8.42 s, and its slowest response on a mandibular reconstruction case was 20.41 s. Figure 6 shows a comparison between the two LLMs.We present the complete list of the LLMs' responses in the Supplementary File.was 20.41 s. Figure 6 shows a comparison between the two LLMs.We present the complete list of the LLMs' responses in the Supplementary File.

Discussion
In plastic surgery, LLMs are recognized due to their constant advancement and adequate medical performance [2,14,16,22,24,28,35].Nevertheless, due to the inherent conditions of the intraoperative environment, in most scenarios, the margin of error is near 0.

Discussion
In plastic surgery, LLMs are recognized due to their constant advancement and adequate medical performance [2,14,16,22,24,28,35].Nevertheless, due to the inherent conditions of the intraoperative environment, in most scenarios, the margin of error is near 0.Moreover, these models' use was previously limited as they were bound to text-only input and output, consuming valuable time during the surgical procedure.The new updates to ChatGPT and Gemini allow the models to receive and provide information with audio, which might improve the effectiveness of their use during the intraoperative period.Although the current experimentation of LLMs as intraoperative tools is limited, their application is promising.
Due to their ability to process vast amounts of data in several formats, LLMs can help with intraoperative monitoring and alert surgeons to intervene in a timely fashion [9,[11][12][13].They can also suggest individualized procedural modifications based on the latest research and clinical guidelines [14,20] while automatically generating surgical records with key information from the procedure [12,13].LLMs may be groundbreaking for improving surgical outcomes by complementing human expertise and enhancing coordination with other AI instruments [9,12].Ultimately, their responses can enhance the efficiency and precision of surgery as they connect theoretical knowledge and real-time surgical application [14].This is the first study evaluating the current state of ChatGPT-4 and Gemini as intraoperative decision-support tools in plastic surgery without using any RAG technique.Furthermore, we compared the two models to identify their strengths and limitations and determine which was superior.For providing medically accurate information, ChatGPT-4 outperformed Gemini, whose answers were determined to be both partially correct and incorrect almost 60% of the time.While ChatGPT-4's responses were more accurate, most of them were still determined to be only partially correct.Nevertheless, in most scenarios, both models proved an understanding of the situation about which they were questioned and retrieved concise and logical responses.
Similar to the results obtained by Atkinson et al. [14], the language models demonstrated an adequate understanding of anatomy most of the time, being able to help identify anatomical structures and landmarks to guide the surgeon during the procedure.This was especially noticeable during the lymphaticovenous bypass, mandibular reconstruction, and breast augmentation procedures.During the former, the models accurately recommended where to place the incisions when the ICG was not enough to find healthy lymphatics based on the understanding of the lymphatic and venous organization and their anatomical relationship.Moreover, the models were able to adequately suggest the appropriate location of the nourishing vessels for the fibula osteoseptocutaneous flap and provide useful recommendations for correctly identifying the major pectoral muscle for submuscular breast implant placement.
However, this performance was inconsistent, especially for ChatGPT-4.The model particularly struggled with the complete cleft lip repair, starting from the inability to accurately identify the anatomical locations for marking placement.Although ChatGPT-4 adequately recommended using the facial artery as a recipient for the fibula osteoseptocutaneous flap, it also recommended using the lingual and maxillary artery.Additionally, it also misunderstood the setting and provided irrelevant recommendations, where instead of offering actionable guidance, it recommended some literature.Gemini was somewhat better, notably for the cheiloplasty.It also provided more relevant responses.An example was seen in the mandibular reconstruction, where despite only recommending the facial artery as a recipient vessel, it provided a logical explanation for its reasoning, explaining why to discard other vessels, and recommended supplemental articles.
When it came to procedural steps, general procedure knowledge, and complication solving, ChatGPT-4 outperformed Gemini.Nonetheless, both models showed a strong grasp of the procedures and offered concise and logical guidance throughout.ChatGPT-4 particularly excelled in the breast augmentation and jaw reconstruction scenarios, while Gemini excelled in the complete cleft lip repair.They achieved the most similar perfor-Medicina 2024, 60, 957 9 of 12 mance during the lymphaticovenous bypass scenarios, where the models mostly provided both accurate and relevant responses.Conversely, their worst performance was on the carpal tunnel release procedure, where most of ChatGPT-4's responses were superficial or incomplete, and those of Gemini's were also irrelevant as they did not offer immediate solutions.This was consistent with another study evaluating ChatGPT for providing carpal tunnel syndrome diagnosis and management, where its responses were superficial, with no deeper explanation or reasoning for specific treatments, and referenced nonexistent publications [24].
Additionally, ChatGPT-4 kept struggling in terms of the cheiloplasty procedure, erroneously recommending pursuing a Mohler's incision for a philtral column discrepancy of 1.7 mm without providing any reasoning.On the other hand, when asked about the recommended ischemia time limit for the fibular osteocutaneous flap, Gemini incorrectly advised that it was safe to exceed 5 h.Moreover, Gemini provided irrelevant responses to 40% of the questions, many of which were related to the model, stating that it was unable to provide medical advice and limiting its answers to recommending related literature or websites.
The models provided superficial and incomplete responses even when they were accurate, and while they were enough for some scenarios presented, for others they were not.In Mohapatra et al. [2], ChatGPT was evaluated as a teaching assistant for plastic surgery residents, and the authors concluded that the model was likely to cause confusion among residents as, despite providing fairly accurate procedural steps, it also provided inaccurate statements and missed critical steps.Additionally, Atkinson et al. [14] identified that although ChatGPT's responses were consistently accurate, they were somewhat superficial and corresponded to the knowledge level of a trainee, not offering any insight beyond what an expert plastic surgeon would already know.
Given the time-sensitive nature of the intraoperative scenarios, LLMs must provide concise and fast responses so that the surgical team can act timely.The FKGL and the FRE scores consider the number of sentences and words to determine a text's reading level.In a previous study, ChatGPT's FKGL and FRE score indicated a hard reading level appropriate for only 33% of adults and those with a college education [36].Furthermore, the texts produced by ChatGPT were harder than those from Bard, Gemini's predecessor [37].Our results show similar characteristics, as Gemini's FKGL was significantly lower than ChatGPT-4's, and although there was no significant difference in the FRE score, Gemini's was higher than ChatGPT-4's, indicating that Gemini's responses were easier to read.Even though a surgeon surpasses the level of education required for comprehending either model's text, more readability also indicates more concise, straightforward responses.This came in conjunction with timelier responses, as Gemini proved to respond significantly faster than ChatGPT-4.However, Gemini's evasiveness when responding may contribute to the difference in average readability scores and time of response.
The use of LLMs as intraoperative decision-support tools has significant ethical implications.Despite the models' accurate and relevant performance, the lack of depth and consistency in their responses can lead to patient harm.This may raise issues of accountability and responsibility, as liability remains uncertain in the case of a negative outcome due to erroneous LLM guidance.Notably, Gemini states its inability to provide medical advice before continuing further with its response, while this is not the case with ChatGPT-4, which directly provides its answer.Moreover, there is the issue of data privacy and security, as current LLMs share all the information within their chats to their servers, but to be integrated into real clinical settings, they would need to handle sensitive patient information [38].Additionally, as the models' responses are based on their training data, they are subject to biases, posing the risk of outdated information, unequal care quality, and discrimination [39,40].

Strengths and Limitations
This is the first study comparing the current state of two of the most common and readily available LLMs as intraoperative decision-support tools in plastic and reconstructive surgery.By providing scenarios evaluating the models' knowledge of anatomy, procedural steps, and problem-solving concerning five different procedures, we analyzed the models' generalizability in the specialty.However, our study has some limitations.First is the limited number of questions per procedure, which limited the depth with which we could explore the model's understanding of the procedures.Additionally, the design of the scenarios evaluating one procedure in independent patients limited our capacity to replicate and test the models' ability to adapt to different complications in the dynamic and multifaceted nature of real-time surgical decision-making, where multiple factors and changing conditions must be considered simultaneously.Finally, the relatively small sample size may affect the power to detect significant differences between the LLMs, especially in the subgroup analyses of individual procedures.While our findings provide a strong foundation for understanding LLMs' performance in plastic surgery, their generalizability to other surgical specialties needs further investigation.The current limitations are likely influenced by the models' training data, which may not be equally comprehensive across other surgical domains.On the other hand, some limitations may be generalizable, such as the models' struggle to understand the context due to flaws with the prompting.Using advanced prompting techniques could have avoided the superficiality and evasiveness of the models' responses and helped improve their accuracy and relevance.By tailoring the prompts to the specific terminology and information needs of different surgical specialties, we can investigate whether these techniques can improve the overall accuracy, relevance, and applicability of LLMs for surgeons across a broader range of procedures.
It is important to consider that LLMs evolve rapidly and constantly, potentially limiting the truthfulness of our results in the near future.Nevertheless, the continuous evaluation of these models' performance provides crucial insights to guide their future development and implementation in practice.Future research directions point toward the development of specialty-specific models leveraging fine-tuning techniques, such as retrieval-augmented generation, that allow restraining the models to good, accurate information and improve their contextual understanding.

Conclusions
Our study provided valuable insights into the current state of two readily available LLMs.Although ChatGPT-4 generally provided more medically accurate and relevant responses than Gemini, both models demonstrated adequate knowledge for supporting surgeons during operative procedures.However, the performance of both LLMs varied across the different surgical scenarios, with neither model consistently delivering completely correct or relevant information.This variability highlights the need for further development and optimization to ensure their reliability and precision in the intraoperative setting.This study underscored the critical balance between accuracy, relevance, speed, and conciseness that LLMs must achieve to be effectively integrated into this part of surgical practice.While the models have no immediate application, they still may provide valuable guidance, especially for inexperienced surgeons and residents.Additional experimentation leveraging retrieval-augmented generation techniques might help overcome the models' current limitations and accelerate their application in the operating room.

Figure 1 .
Figure 1.Example of the questions provided to the LLMs.

Figure 1 .
Figure 1.Example of the questions provided to the LLMs.

Figure 6 .
Figure 6.LLMs' average response time in seconds.Error bars represent standard deviations.

Figure 6 .
Figure 6.LLMs' average response time in seconds.Error bars represent standard deviations.