Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions
Abstract
:Simple Summary
Abstract
1. Introduction
1.1. Applications of LLMs in Cancer Care
1.2. State of the Art in Toxicity Monitoring
1.3. Overview and Paper Structure
- Methods: describes the study design, including the creation of fictitious cases, the evaluation process by expert oncologists, and the contextualization of the LLM model for classifying subjective toxicities.
- Results: presents the comparative analysis between the evaluations performed by oncologists and the LLM model, detailing accuracy metrics and types of errors.
- Discussion: interprets the findings, highlighting the potential and limitations of LLMs in clinical practice and suggesting directions for future research.
- Conclusions: summarizes the key findings and their implications for clinical oncology, emphasizing the need for specific training of LLMs and further studies with real patients.
2. Methods
2.1. Study Design
2.1.1. Study Type
2.1.2. Study Objectives
- Evaluate the ability of an LLM to classify subjective toxicities in patients undergoing chemotherapy.
- Assess the feasibility of using fictitious cases as a basis to justify future studies with real patients.
- Evaluate the accuracy of a contextualized LLM model without having been specifically trained for the task of classifying subjective toxicities.
2.1.3. General Approach Description
2.2. Ethical Considerations
2.3. Participants
2.3.1. Participant Description
2.3.2. Number of Participating Oncologists
2.3.3. Inclusion and Exclusion Criteria
- Current clinical practice in medical oncology.
- Familiarity with the use of the CTCAE version 5.0 [28].
- Availability to participate in the entire study.
- Could not complete the evaluation of all assigned cases.
2.4. Fictitious Cases
2.4.1. Creation of Fictitious Cases
2.4.2. Process of Generating 30 Expert-Knowledge-Based Fictitious Cases
2.5. Toxicity Chart
Evaluation of Toxicity According to CTCAE v.5
- Grade 0 (None): Absence of the evaluated toxicity.
- Grade 1 (Mild): Mild or asymptomatic symptoms that do not require medical intervention. Patients can continue with their daily activities without significant interruptions.
- Grade 2 (Moderate): Moderate symptoms that may require minimal, local, or non-invasive intervention. Patients experience limitations in Instrumental Activities of Daily Living (IADLs) but can manage them with some difficulty.
- Grade 3 (Severe): Severe, medically significant, and potentially disabling symptoms. They may require hospitalization or the prolongation of hospital stay, limiting the patient’s ability to perform self-care Activities of Daily Living (ADLs).
- Grade 4 (Life-threatening): Adverse events with potentially life-threatening consequences that require urgent medical intervention. They represent an immediate threat to the patient’s life.
- Grade 5 (Death): Adverse events resulting in the patient’s death.
2.6. Expert Evaluation
2.7. Contextualization of the LLM Model
Process of Contextualizing the Model for Classifying Subjective Toxicities
2.8. Analysis of Results
- Accuracy General Categories: This was defined as the agreement between the severity classifications of toxicities by the LLM model and the classifications by the oncologists, grouping the toxicities into general categories. These categories were divided into “no toxicity” (0), “mild toxicities” (1–2), and “severe toxicities” (3–4). Accuracy was calculated by comparing whether the model’s classification was within the same general category as the mode of the oncologists.
- Accuracy Specific Categories: The model’s accuracy was evaluated in terms of exact agreement with the oncologists’ specific classifications for each toxicity without grouping them.
2.8.1. Types of Errors (Mild and Severe)
- Mild Errors: Mild errors are those where the model classified a toxicity with a higher grade than indicated by the oncologists. For example, if a patient presents with a mild taste alteration (grade 1) and the model classifies it as moderate (grade 2). In a real clinical setting, mild errors could lead to more conservative management, involving additional tests. While this may increase the workload for healthcare staff and associated costs, it does not directly compromise patient safety.
- Severe Errors: Severe errors are defined as those where the model classified a toxicity with a lower grade than indicated by the oncologists: for example, if a patient presents with severe diarrhea (grade 3) and the model classifies it as moderate (grade 1). These errors can have a more significant negative impact on patient management. Underestimating the severity of a toxicity could delay appropriate treatment, increasing the risk of serious complications. For example, not correctly identifying severe diarrhea could lead to severe patient dehydration and avoidable hospitalizations.
2.8.2. False Alarm Analysis
3. Results
3.1. Dispersion in Expert Evaluation
3.2. Mode Evaluation
4. Discussion
4.1. Summary of Main Findings
4.2. Interpretation of the Results
4.3. Errors
4.4. Limitations
4.5. Clinical and Practical Relevance
4.6. Future Implications
- Studies with Real Patients: Conduct studies with real patients to validate the accuracy and utility of the LLM model in a clinical setting. These studies should consider the AI system’s performance across different demographic groups, disease severity levels, and comorbidities.
- Specific Model Training: Develop and train LLM models specifically for the classification of oncological toxicities, using relevant clinical datasets. Tailoring the training data to include a wide variety of clinical scenarios will enhance the model’s accuracy and reliability.
- Interactive Evaluations: Implement evaluations where LLMs can interact directly with patients, adjusting their assessments in real time. Real-time interactions can provide additional context and clarification, leading to more precise assessments.
- Sample Expansion: Include a larger number of oncologists and cases to ensure the representativeness and robustness of the results. A more extensive sample size will enhance the generalizability of the findings.
- Robustness and Generalization: Ensure healthcare AI systems demonstrate robust performance across diverse and challenging scenarios, including variations in data quality, noise, missing data, and adversarial attacks. Robustness testing should evaluate the AI system’s performance under different conditions and assess its ability to generalize to unseen data and real-world clinical settings.
5. Conclusions
5.1. Summary of Key Findings
5.2. Study Impact
5.3. Future Research Directions
5.4. Ethical Considerations
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- OpenAI. Creating Safe AGI That Benefits All of Humanity. Available online: https://openai.com/ (accessed on 28 June 2024).
- Mumtaz, U.; Ahmed, A.; Mumtaz, S. LLMs-Healthcare: Current Applications and Challenges of Large Language Models in Various Medical Specialties. arXiv 2024, arXiv:2311.12882. [Google Scholar] [CrossRef]
- Iannantuono, G.M.; Bracken-Clarke, D.; Floudas, C.S.; Roselli, M.; Gulley, J.L.; Karzai, F. Applications of Large Language Models in Cancer Care: Current Evidence and Future Perspectives. Front. Oncol. 2023, 13, 1268915. [Google Scholar] [CrossRef]
- Wu, D.J.; Bibault, J.E. Pilot applications of GPT-4 in radiation oncology: Summarizing patient symptom intake and targeted chatbot applications. Radiother. Oncol. 2024, 190, 109978. [Google Scholar] [CrossRef] [PubMed]
- Floyd, W.; Kleber, T.; Pasli, M.; Qazi, J.J.; Huang, C.C.; Leng, J.X.; Boyer, M.J. Evaluating the Reliability of Chat-GPT Model Responses for Radiation Oncology Patient Inquiries. Int. J. Radiat. Oncol. Biol. Phys. 2023, 117, e383. [Google Scholar] [CrossRef]
- Floyd, W.; Kleber, T.; Carpenter, D.J.; Pasli, M.; Qazi, J.; Huang, C.; Boyer, M.J. Current Strengths and Weaknesses of ChatGPT as a Resource for Radiation Oncology Patients and Providers. Int. J. Radiat. Oncol. Biol. Phys. 2024, 118, 905–915. [Google Scholar] [CrossRef] [PubMed]
- Sorin, V.; Glicksberg, B.S.; Barash, Y.; Konen, E.; Nadkarni, G.; Klang, E. Applications of Large Language Models (LLMs) in Breast Cancer Care. medRxiv 2023. [Google Scholar] [CrossRef]
- Yang, R.; Tan, T.F.; Lu, W.; Thirunavukarasu, A.J.; Ting, D.S.W.; Liu, N. Large language models in health care: Development, applications, and challenges. Health Care Sci. 2023, 2, 255–263. [Google Scholar] [CrossRef] [PubMed]
- Borna, S.; Gomez-Cabello, C.A.; Pressman, S.M.; Haider, S.A.; Sehgal, A.; Leibovich, B.C.; Cole, D.; Forte, A.J. Comparative Analysis of Artificial Intelligence Virtual Assistant and Large Language Models in Post-Operative Care. Eur. J. Investig. Health Psychol. Educ. 2024, 14, 1413–1424. [Google Scholar] [CrossRef] [PubMed]
- Shahbar, A.N.; Alrumaih, I.; Alzahrani, T.; Alzahrani, A.; Alanizi, A.; Alrashed, M.A.; Elrggal, M.; Alhuthali, A.; Alsuhebany, N. Advancing Cancer Care: How Artificial Intelligence is Transforming Oncology Pharmacy. Inform. Med. Unlocked 2024, in press. [Google Scholar]
- Basch, E.; Iasonos, A.; Barz, A.; Culkin, A.; Kris, M.G.; Artz, D.; Schrag, D. Long-term toxicity monitoring via electronic patient-reported outcomes in patients receiving chemotherapy. J. Clin. Oncol. 2007, 25, 5374–5380. [Google Scholar] [CrossRef]
- Ye, J.; Hai, J.; Song, J.; Wang, Z. The role of artificial intelligence in the application of the integrated electronic health records and patient-generated health data. medRxiv 2024. [Google Scholar] [CrossRef]
- Chen, J.; Ou, L.; Hollis, S.J. A systematic review of the impact of routine collection of patient reported outcome measures on patients, providers and health organisations in an oncologic setting. BMC Health Serv. Res. 2013, 13, 211. [Google Scholar] [CrossRef] [PubMed]
- LeBlanc, T.W.; Abernethy, A.P. Patient-reported outcomes in cancer care—Hearing the patient voice at greater volume. Nat. Rev. Clin. Oncol. 2017, 14, 763–772. [Google Scholar] [CrossRef]
- Basch, E.; Deal, A.M.; Kris, M.G.; Scher, H.I.; Hudis, C.A.; Sabbatini, P.; Schrag, D. Symptom monitoring with patient-reported outcomes during routine cancer treatment: A randomized controlled trial. J. Clin. Oncol. 2016, 34, 557. [Google Scholar] [CrossRef] [PubMed]
- Basch, E.; Reeve, B.B.; Mitchell, S.A.; Clauser, S.B.; Minasian, L.M.; Dueck, A.C.; Schrag, D. Development of the National Cancer Institute’s patient-reported outcomes version of the common terminology criteria for adverse events (PRO-CTCAE). J. Natl. Cancer Inst. 2014, 106, dju244. [Google Scholar] [CrossRef] [PubMed]
- Basch, E.; Artz, D.; Iasonos, A.; Speakman, J.; Shannon, K.; Lin, K.; Schrag, D. Evaluation of an online platform for cancer patient self-reporting of chemotherapy toxicities. J. Am. Med. Inform. Assoc. 2007, 14, 264–268. [Google Scholar] [CrossRef] [PubMed]
- Rasschaert, M.; Vulsteke, C.; De Keersmaeker, S.; Vandenborne, K.; Dias, S.; Verschaeve, V.; Peeters, M. AMTRA: A multicentered experience of a web-based monitoring and tailored toxicity management system for cancer patients. Support. Care Cancer 2021, 29, 859–867. [Google Scholar] [CrossRef] [PubMed]
- Maguire, R.; McCann, L.; Kotronoulas, G.; Kearney, N.; Ream, E.; Armes, J.; Patiraki, E.; Furlong, E.; Fox, P.; Gaiger, A.; et al. Real time remote symptom monitoring during chemotherapy for cancer: European multicentre randomised controlled trial (eSMART). BMJ 2021, 374, n1647. [Google Scholar] [CrossRef] [PubMed]
- Di Maio, M.; Basch, E.; Denis, F.; Fallowfield, L.J.; Ganz, P.A.; Howell, D.; Kowalski, C.; Perrone, F.; Stover, A.M.; Sundaresan, P.; et al. The role of patient-reported outcome measures in the continuum of cancer clinical care: ESMO Clinical Practice Guideline. Ann. Oncol. 2022, 33, 878–892. [Google Scholar] [CrossRef]
- Govindaraj, R.; Agar, M.; Currow, D.; Luckett, T. Assessing patient-reported outcomes in routine cancer clinical care using electronic administration and telehealth technologies: Realist synthesis of potential mechanisms for improving health outcomes. J. Med. Internet Res. 2023, 25, e48483. [Google Scholar] [CrossRef]
- Van Den Hurk, C.J.; Mols, F.; Eicher, M.; Chan, R.J.; Becker, A.; Geleijnse, G.; Walraven, I.; Coolbrandt, A.; Lustberg, M.; Velikova, G.; et al. A narrative review on the collection and use of electronic patient-reported outcomes in cancer survivorship care with emphasis on symptom monitoring. Curr. Oncol. 2022, 29, 4370–4385. [Google Scholar] [CrossRef]
- LMSYS. Chatbot Arena Leaderboard. Available online: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard (accessed on 15 April 2024).
- Allen Institute for AI. ARC Leaderboard. Available online: https://leaderboard.allenai.org/arc/submissions/get-started (accessed on 29 July 2024).
- Rowan Zellers. HellaSwag. Available online: https://rowanzellers.com/hellaswag/ (accessed on 29 July 2024).
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
- Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv 2023, arXiv:2306.05685. [Google Scholar]
- Shah, S. Common Terminology Criteria for Adverse Events. Natl. Cancer Inst. 2022, 784, 785. [Google Scholar]
- OpenAI. Introducing GPTs. Available online: https://openai.com/index/introducing-gpts/ (accessed on 28 June 2024).
- OpenAI. Hello GPT-4o. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 28 June 2024).
Variable | Value |
---|---|
Anorexia | 0.794 |
Depression | 0.664 |
Taste alteration | 0.637 |
Asthenia | 0.598 |
Hand-foot syndrome | 0.554 |
Alopecia | 0.507 |
Nausea | 0.503 |
Diarrhea | 0.501 |
Peripheral neuropathy | 0.484 |
Mucositis | 0.473 |
Vomiting | 0.466 |
Insomnia | 0.446 |
Skin alteration | 0.411 |
Fever | 0.383 |
Dyspnea | 0.357 |
Hyperhidrosis | 0.323 |
Headache | 0.288 |
Constipation | 0.224 |
Abdominal pain | 0.217 |
Conjunctivitis | 0.168 |
Hematuria | 0.137 |
Mode | Mean | |
---|---|---|
Accuracy in General Categories | 81.5% | 85.7% |
Accuracy in Specific Categories | 64.4% | 64.6% |
Mild Errors/Severe Errors | 96/4% | 96.4/3.6% |
False Alarms | 8.9% | 3% |
Mode | Mean | |
---|---|---|
Accuracy in General Categories | 72.5–89.0% | 66.7–89.2% |
Accuracy in Specific Categories | 64.2–80.0% | 57.0–76.0% |
Mode | Mean | |
---|---|---|
Accuracy in General Categories | 81.3–87.9% | 81.9–86.9% |
Accuracy in Specific Categories | 72.9–77.2% | 67.6–75.6% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ruiz Sarrias, O.; Martínez del Prado, M.P.; Sala Gonzalez, M.Á.; Azcuna Sagarduy, J.; Casado Cuesta, P.; Figaredo Berjano, C.; Galve-Calvo, E.; López de San Vicente Hernández, B.; López-Santillán, M.; Nuño Escolástico, M.; et al. Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions. Cancers 2024, 16, 2830. https://doi.org/10.3390/cancers16162830
Ruiz Sarrias O, Martínez del Prado MP, Sala Gonzalez MÁ, Azcuna Sagarduy J, Casado Cuesta P, Figaredo Berjano C, Galve-Calvo E, López de San Vicente Hernández B, López-Santillán M, Nuño Escolástico M, et al. Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions. Cancers. 2024; 16(16):2830. https://doi.org/10.3390/cancers16162830
Chicago/Turabian StyleRuiz Sarrias, Oskitz, María Purificación Martínez del Prado, María Ángeles Sala Gonzalez, Josune Azcuna Sagarduy, Pablo Casado Cuesta, Covadonga Figaredo Berjano, Elena Galve-Calvo, Borja López de San Vicente Hernández, María López-Santillán, Maitane Nuño Escolástico, and et al. 2024. "Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions" Cancers 16, no. 16: 2830. https://doi.org/10.3390/cancers16162830
APA StyleRuiz Sarrias, O., Martínez del Prado, M. P., Sala Gonzalez, M. Á., Azcuna Sagarduy, J., Casado Cuesta, P., Figaredo Berjano, C., Galve-Calvo, E., López de San Vicente Hernández, B., López-Santillán, M., Nuño Escolástico, M., Sánchez Togneri, L., Sande Sardina, L., Pérez Hoyos, M. T., Abad Villar, M. T., Zabalza Zudaire, M., & Sayar Beristain, O. (2024). Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions. Cancers, 16(16), 2830. https://doi.org/10.3390/cancers16162830