Large Language Models for Drug-Related Adverse Events in Oncology Pharmacy: Detection, Grading, and Actioning
Abstract
1. Introduction
2. Methodology
2.1. Scope of Review (Inclusion and Exclusion)
2.2. Article Selection, Literature Search, and Review
3. Results
Overview of Selected Articles
4. Discussion
4.1. Synthesis of the Selected Articles
4.1.1. What’s Most Mature: AE Detection from Clinical Narratives
4.1.2. From Detection to Grading: CTCAE Alignment Is Feasible but Harder
4.1.3. Toward Grade-Aligned Actions: Early Signals, Limited Direct Evaluation
4.1.4. Modality-Specific Toxicity Use Cases (CAR-T, Radiotherapy, Antibody–Drug Conjugates)
4.1.5. Non-English and Cross-System Generalizability
4.1.6. Guardrails and Design Patterns That Helped
- (a)
- (b)
- Patient-level aggregation and temporal anchoring. Aggregating sentence- or note-level signals to the patient timeline (and constraining to clinically plausible windows) stabilized performance across sites [36].
- (c)
- Retrieval/citation and domain adaptation. Systems that incorporated citation-style retrieval or modest domain-adaptation (e.g., small in-domain labels) reported sizeable gains, for instance, +40% F1 with 100 annotated notes versus zero-shot [44] and improved precision for post-RT symptoms after per-symptom refinement [42].
4.1.7. Scale, Workload, and “Fit for Use”
4.1.8. What Remains Under-Reported
5. Future Directions
5.1. Start with Surveillance and Triage
5.2. Make Outputs Speak CTCAE
5.3. Prove Recommendations with Citations
5.4. Anchor in Time and at the Patient Level
5.5. Keep Pharmacists in the Loop by Design
5.6. Tune Precision to the Intended Use
5.7. Measure Decision Impact, Not Detection Alone
5.8. Standardize Versioning and Error Taxonomies
5.9. Build Pharmacist-Centric Benchmarks
5.10. Integrate and Govern Before You Automate
5.11. Test Generalizability and Equity
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Siegel, R.D.; LeFebvre, K.B.; Temin, S.; Evers, A.; Barbarotta, L.; Bowman, R.M.; Chan, A.; Dougherty, D.W.; Ganio, M.; Hunter, B.; et al. Antineoplastic Therapy Administration Safety Standards for Adult and Pediatric Oncology: ASCO-ONS Standards. JCO Oncol. Pract. 2024, 20, 1314–1330. [Google Scholar] [CrossRef]
- Weingart, S.N.; Zhang, L.; Sweeney, M.; Hassett, M. Chemotherapy medication errors. Lancet Oncol. 2018, 19, e191–e199. [Google Scholar] [CrossRef]
- Nashed, A.; Zhang, S.; Chiang, C.-W.; Zitu, M.; Otterson, G.A.; Presley, C.J.; Kendra, K.; Patel, S.H.; Johns, A.; Li, M.; et al. Comparative assessment of manual chart review and ICD claims data in evaluating immuno-therapy-related adverse events. Cancer Immunol. Immunother. CII 2021, 70, 2761–2769. [Google Scholar] [CrossRef] [PubMed]
- Hematology/Oncology Pharmacy Association. (n.d.). HOPA Issue Brief on Hematology/Oncology Pharmacists [Issue Brief]. Available online: https://www.hoparx.org/documents/78/HOPA_About_Hem_Onc_Pharmacist_Issue_Brief_FINAL1.pdf (accessed on 10 October 2025).
- Shah, S. Common terminology criteria for adverse events. Natl. Cancer Inst. USA 2022, 784, 785. [Google Scholar]
- Lee, D.W.; Santomasso, B.D.; Locke, F.L.; Ghobadi, A.; Turtle, C.J.; Brudno, J.N.; Maus, M.V.; Park, J.H.; Mead, E.; Pavletic, S.; et al. ASTCT Consensus Grading for Cytokine Release Syndrome and Neurologic Toxicity Associated with Immune Effector Cells. Biol. Blood Marrow Transplant. J. Am. Soc. Blood Marrow Transplant. 2019, 25, 625–638. [Google Scholar] [CrossRef] [PubMed]
- Chou, C.K.; Turtle, C.J. Assessment and management of cytokine release syndrome and neurotoxicity following CD19 CAR-T cell therapy. Expert Opin. Biol. Ther. 2020, 20, 653–664. [Google Scholar] [CrossRef]
- Schneider, B.J.; Naidoo, J.; Santomasso, B.D.; Lacchetti, C.; Adkins, S.; Anadkat, M.; Atkins, M.B.; Brassil, K.J.; Caterino, J.M.; Chau, I.; et al. Management of Immune-Related Adverse Events in Patients Treated With Immune Check-point Inhibitor Therapy: ASCO Guideline Update. J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol. 2021, 39, 4073–4126. [Google Scholar] [CrossRef]
- Brahmer, J.R.; Lacchetti, C.; Schneider, B.J.; Atkins, M.B.; Brassil, K.J.; Caterino, J.M.; Chau, I.; Ernstoff, M.S.; Gardner, J.M.; Ginex, P.; et al. Management of Immune-Related Adverse Events in Patients Treated with Immune Checkpoint Inhibitor Therapy: American Society of Clinical Oncology Clinical Practice Guideline. J. Clin. Oncol. 2018, 36, 1714–1768. [Google Scholar] [CrossRef]
- Puzanov, I.; Diab, A.; Abdallah, K.; Bingham, C.O., 3rd; Brogdon, C.; Dadu, R.; Hamad, L.; Kim, S.; Lacouture, M.E.; LeBoeuf, N.R.; et al. Managing toxicities associated with immune checkpoint inhibitors: Consensus recommendations from the Society for Immunotherapy of Cancer (SITC) Toxicity Management Working Group. J. Immunother. Cancer 2017, 5, 95. [Google Scholar] [CrossRef]
- Chiu, C.-C.; Wu, C.-M.; Chien, T.-N.; Kao, L.-J.; Li, C.; Chu, C.-M. Integrating Structured and Unstructured EHR Data for Predicting Mortality by Machine Learning and Latent Dirichlet Allocation Method. Int. J. Environ. Res. Public Health 2023, 20, 4340. [Google Scholar] [CrossRef]
- Negro-Calduch, E.; Azzopardi-Muscat, N.; Krishnamurthy, R.S.; Novillo-Ortiz, D. Technological progress in electronic health record system optimization: Systematic review of systematic literature reviews. Int. J. Med. Inform. 2021, 152, 104507. [Google Scholar] [CrossRef]
- Zitu, M.; Gatti-Mays, M.; Johnson, K.; Zhang, S.; Shendre, A.; Elsaid, M.; Li, L. Detection of Patient-Level Immunotherapy-Related Adverse Events (irAEs) from Clinical Narratives of Electronic Health Records: A High-Sensitivity Artificial Intelligence Model. Pragmatic Obs. Res. 2024, 15, 243–252. [Google Scholar] [CrossRef] [PubMed]
- Gupta, S.; Belouali, A.; Shah, N.J.; Atkins, M.B.; Madhavan, S. Automated Identification of Patients With Immune-Related Adverse Events From Clinical Notes Using Word Embedding and Machine Learning. JCO Clin. Cancer Inform. 2021, 5, 541–549. [Google Scholar] [CrossRef] [PubMed]
- Zitu, M.; Li, L.; Elsaid, M.I.; Gatti-Mays, M.E.; Manne, A.; Shendre, A. Comparative assessment of manual chart review and patient-level adverse drug event identification using artificial intelligence in evaluating immunotherapy-related adverse events (irAEs). J. Clin. Oncol. 2023, 41, e13583. [Google Scholar] [CrossRef]
- Iannantuono, G.M.; Bracken-Clarke, D.; Floudas, C.S.; Roselli, M.; Gulley, J.L.; Karzai, F. Applications of large lan-guage models in cancer care: Current evidence and future perspectives. Front. Oncol. 2023, 13, 1268915. [Google Scholar] [CrossRef]
- Zhu, L.; Mou, W.; Chen, R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J. Transl. Med. 2023, 21, 269. [Google Scholar] [CrossRef]
- Sorin, V.; Klang, E.; Sklair-Levy, M.; Cohen, I.; Zippel, D.B.; Lahat, N.B.; Konen, E.; Barash, Y. Large language model (ChatGPT) as a support tool for breast tumor board. NPJ Breast Cancer 2023, 9, 44. [Google Scholar] [CrossRef]
- Zitu, M.; Le, T.D.; Duong, T.; Haddadan, S.; Garcia, M.; Amorrortu, R.; Zhao, Y.; Rollison, D.E.; Thieu, T. Large language models in cancer: Potentials, risks, and safeguards. BJR Artif. Intell. 2024, 2, ubae019. [Google Scholar] [CrossRef] [PubMed]
- Shah, S.V. Accuracy, consistency, and hallucination of large language models when analyzing unstructured clinical notes in electronic medical records. JAMA Netw. Open 2024, 7, e2425953. [Google Scholar] [CrossRef] [PubMed]
- Tozuka, R.; Johno, H.; Amakawa, A.; Sato, J.; Muto, M.; Seki, S.; Komaba, A.; Onishi, H. Application of NotebookLM, a large language model with retrieval-augmented generation, for lung cancer staging. Jpn. J. Radiol. 2024, 43, 706–712. [Google Scholar] [CrossRef]
- Meskó, B.; Topol, E.J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 2023, 6, 120. [Google Scholar] [CrossRef]
- Neuss, M.N.; Gilmore, T.R.; Belderson, K.M.; Billett, A.L.; Conti-Kalchik, T.; Harvey, B.E.; Hendricks, C.; LeFebvre, K.B.; Mangu, P.B.; McNiff, K.; et al. 2016 Updated American Society of Clinical Oncology/Oncology Nursing Society Chemotherapy Administration Safety Standards, Including Standards for Pedi-atric Oncology. J. Oncol. Pract. 2016, 12, 1262–1271. [Google Scholar] [CrossRef]
- Mackler, E.; Segal, E.M.; Muluneh, B.; Jeffers, K.; Carmichael, J. 2018 Hematology/Oncology Pharmacist Association Best Practices for the Management of Oral Oncolytic Therapy: Pharmacy Practice Standard. J. Oncol. Pract. 2019, 15, e346–e355. [Google Scholar] [CrossRef]
- Wat, S.K.; Wesolowski, B.; Cierniak, K.; Roberts, P. Assessing the impact of an electronic chemotherapy order ver-ification checklist on pharmacist reported errors in oncology infusion centers of a health-system. J. Oncol. Pharm. Pract. Off. Publ. Int. Soc. Oncol. Pharm. Pract. 2023, 31, 65–71. [Google Scholar] [CrossRef]
- Ranchon, F.; Salles, G.; Späth, H.-M.; Schwiertz, V.; Vantard, N.; Parat, S.; Broussais, F.; You, B.; Tartas, S.; Souquet, P.J.; et al. Chemotherapeutic errors in hospitalised cancer patients: Attributable damage and extra costs. BMC Cancer 2011, 11, 478. [Google Scholar] [CrossRef]
- Schlichtig, K.; Dürr, P.; Dörje, F.; Fromm, M.F. Medication Errors During Treatment with New Oral Anticancer Agents: Consequences for Clinical Practice Based on the AMBORA Study. Clin. Pharmacol. Ther. 2021, 110, 1075–1086. [Google Scholar] [CrossRef]
- Fentie, A.M.; Huluka, S.A.; Gebremariam, G.T.; Gebretekle, G.B.; Abebe, E.; Fenta, T.G. Impact of pharmacist-led interventions on medication-related problems among patients treated for cancer: A systematic review and meta-analysis of randomized control trials. Res. Soc. Adm. Pharm. RSAP 2024, 20, 487–497. [Google Scholar] [CrossRef]
- Pennisi, M.; Jain, T.; Santomasso, B.D.; Mead, E.; Wudhikarn, K.; Silverberg, M.L.; Batlevi, Y.; Shouval, R.; Devlin, S.M.; Batlevi, C.; et al. Comparing CAR T-cell toxicity grading systems: Application of the ASTCT grading system and implications for management. Blood Adv. 2020, 4, 676–686. [Google Scholar] [CrossRef] [PubMed]
- Zitu, M.; Owen, D.; Manne, A.; Wei, P.; Li, L. Large Language Models for Adverse Drug Events: A Clinical Per-spective. J. Clin. Med. 2025, 14, 5490. [Google Scholar] [CrossRef] [PubMed]
- Baumeister, R.F.; Leary, M.R. Writing narrative literature reviews. Rev. Gen. Psychol. 1997, 1, 311–320. [Google Scholar] [CrossRef]
- Zhu, M.; Lin, H.; Jiang, J.; Jinia, A.J.; Jee, J.; Pichotta, K.; Waters, M.; Rose, D.; Schultz, N.; Chalise, S.; et al. Large language model trained on clinical oncology data predicts cancer progression. NPJ Digit. Med. 2025, 8, 397. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. Available online: https://arxiv.org/abs/1810.04805 (accessed on 10 October 2025).
- Bakouny, Z.; Ahmed, N.; Fong, C.; Rahman, A.; Perea-Chamblee, T.; Pichotta, K.; Waters, M.; Fu, C.; Jeng, M.Y.; Lee, M.; et al. Use of a large language model (LLM) for pan-cancer automated detection of anti-cancer therapy toxicities and translational toxicity research. J. Clin. Oncol. 2025, 43, 1558. [Google Scholar] [CrossRef]
- Barman, H.; Venkateswaran, S.; Del Santo, A.; Yoo, U.; Silvert, E.; Rao, K.; Raghunathan, B.; Kottschade, L.A.; Block, M.S.; Chandler, G.S.; et al. Identification and characterization of immune checkpoint in-hibitor-induced toxicities from electronic health records using natural language processing. JCO Clin. Cancer Inform. 2024, 8, e2300151. [Google Scholar] [CrossRef]
- Bejan, C.A.; Wang, M.; Venkateswaran, S.; Bergmann, E.A.; Hiles, L.; Xu, Y.; Chandler, G.S.; Brondfield, S.; Silverstein, J.; Wright, F.; et al. irAE-GPT: Leveraging large language models to identify immune-related adverse events in electronic health records and clinical trial datasets. medRxiv 2025. preprint. [Google Scholar] [CrossRef]
- Burnette, H.; Pabani, A.; von Itzstein, M.S.; Switzer, B.; Fan, R.; Ye, F.; Puzanov, I.; Naidoo, J.; Ascierto, P.A.; Gerber, D.E.; et al. Use of artificial intelligence chatbots in clinical management of immune-related adverse events. J. Immunother. Cancer 2024, 12, e008599. [Google Scholar] [CrossRef] [PubMed]
- Ruiz Sarrias, O.; Martínez Del Prado, M.P.; Sala Gonzalez, M.Á.; Azcuna Sagarduy, J.; Casado Cuesta, P.; Figaredo Berjano, C.; Galve-Calvo, E.; López de San Vicente Hernández, B.; López-Santillán, M.; Nuño Escolástico, M.; et al. Leveraging large language models for precision monitoring of chemotherapy-induced toxicities: A pilot study with expert comparisons and future direc-tions. Cancers 2024, 16, 2830. [Google Scholar] [CrossRef] [PubMed]
- Chen, S.; Guevara, M.; Ramirez, N.; Murray, A.; Warner, J.L.; Aerts, H.J.W.L.; Miller, T.A.; Savova, G.K.; Mak, R.H.; Bitterman, D.S. Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy. JCO Clin. Cancer Inform. 2023, 7, e2300048. [Google Scholar] [CrossRef]
- Chumsri, S.; O’SUllivan, C.; Goldberg, H.; Venkateswaran, S.; Silvert, E.; Wagner, T.; Poppe, R.; Genevray, M.; Mohindra, R.; Sanglier, T. Utilizing natural language processing to examine adverse events in HER2+ breast cancer patients. ESMO Open 2025, 10, 105040. [Google Scholar] [CrossRef]
- Geevarghese, R.; Solomon, S.B.; Alexander, E.S.; Marinelli, B.; Chatterjee, S.; Jain, P.; Cadley, J.; Hollingsworth, A.; Chatterjee, A.; Ziv, E. Utility of a large language model for extraction of clinical findings from healthcare data following lung ablation: A feasibility study. J. Vasc. Interv. Radiol. 2024, 36, 704–708. [Google Scholar] [CrossRef] [PubMed]
- Ghanem, A.I.; Khanmohammadi, R.; Verdecchia, K.; Hall, R.; AElshaikh, M.; Movsas, B.; Bagher-Ebadian, H.; Chetty, I.; Ghassemi, M.M.; Thind, K. Late Radiotherapy-related Toxicity Extraction From Clinical Notes Using Large Language Models for Definitively Treated Prostate Cancer Patients, 107th Annual Meeting of the American Radium Society, ARS 2025. Am. J. Clin. Oncol. 2025, 48, S1–S39. [Google Scholar] [CrossRef]
- Guillot, J.; Miao, B.; Suresh, A.; Williams, C.; Zack, T.; Wolf, J.L.; Butte, A. Constructing adverse event timelines for patients receiving CAR-T therapy using large language models. J. Clin. Oncol. 2024, 42 (Suppl. S16), 2555. [Google Scholar] [CrossRef]
- Herman Bernardim Andrade, G.; Nishiyama, T.; Fujimaki, T.; Yada, S.; Wakamiya, S.; Takagi, M.; Kato, M.; Miyashiro, I.; Aramaki, E. Assessing domain adaptation in adverse drug event extraction on real-world breast cancer records. Int. J. Med. Inform. 2024, 191, 105539. [Google Scholar] [CrossRef]
- Hundal, J.; Teplinsky, E. Results of the COMPARE-GPT study: Comparison of medication package inserts and GPT-4 cancer drug information. J. Clin. Oncol. 2024, 42, e13646. [Google Scholar] [CrossRef]
- Tsuchiya, M.; Kawazoe, Y.; Shimamoto, K.; Seki, T.; Imai, S.; Kizaki, H.; Shinohara, E.; Yada, S.; Wakamiya, S.; Aramaki, E.; et al. Elucidating Celecoxib’s Preventive Effect in Capecitabine-Induced Hand-Foot Syndrome Using Medical Natural Language Processing. JCO Clin. Cancer Inform. 2025, 9, e2500096. [Google Scholar] [CrossRef]
- Tsuchiya, M.; Shimamoto, K.; Kawazoe, Y.; Shinohara, E.; Yada, S.; Wakamiya, S.; Imai, S.; Kizaki, H.; Hori, S.; Aramaki, E. Natural Language Processing-Based Approach to Detect Common Adverse Events of Anticancer Agents from Unstructured Clinical Notes: A Time-to-Event Analysis. Stud. Health Technol. Inform. 2025, 329, 703–707. [Google Scholar] [CrossRef]
- Vienne, R.; Filori, Q.; Susplugas, V.; Crochet, H.; Verlingue, L. Prediction of nausea or vomiting, and fatigue ormalaise in cancer care. Cancer Res. 2024, 84, 3475. [Google Scholar] [CrossRef]
- Yanagisawa, Y.; Watabe, S.; Yokoyama, S.; Sayama, K.; Kizaki, H.; Tsuchiya, M.; Imai, S.; Someya, M.; Taniguchi, R.; Yada, S.; et al. Identifying Adverse Events in Outpatients With Prostate Cancer Using Pharmaceutical Care Records in Community Pharmacies: Application of Named Entity Recognition. JMIR Cancer 2025, 11, e69663. [Google Scholar] [CrossRef]
- Zitu, M.M.; Zhang, S.; Owen, D.H.; Chiang, C.; Li, L. Generalizability of machine learning methods in detecting adverse drug events from clinical narratives in electronic medical records. Front. Pharmacol. 2023, 14, 1218679. [Google Scholar] [CrossRef] [PubMed]
- Block, M.S.; Barman, H.; Venkateswaran, S.; Del Santo, A.G.; Yoo, U.; Silvert, E.; Chandler, G.S.; Wagner, T.; Mohindra, R. The role of natural language processing techniques versus conventional methods to gain ICI safety insights from unstructured EHR data. JCO Glob. Oncol. 2023, 9, 136. [Google Scholar] [CrossRef]
- Sun, V.H.; Heemelaar, J.C.; Hadzic, I.; Raghu, V.K.; Wu, C.-Y.; Zubiri, L.; Ghamari, A.; LeBoeuf, N.R.; Abu-Shawer, O.; Kehl, K.L.; et al. Enhancing Precision in Detecting Severe Immune-Related Adverse Events: Compara-tive Analysis of Large Language Models and International Classification of Disease Codes in Patient Records. J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol. 2024, 42, 4134–4144. [Google Scholar] [CrossRef] [PubMed]




| Term | Definition/Full Meaning | Term | Definition/Full Meaning |
|---|---|---|---|
| ADC | Antibody–drug conjugate | ADE | adverse drug event |
| AE(s) | Adverse events | AI | artificial intelligence |
| ARAT | androgen receptor-axis-targeted agents | ASCO | American Society of Clinical Oncology |
| ASTCT | American Society for Transplantation and Cellular Therapy | AUPRC | area under the precision–recall curve |
| BC | breast cancer | BERT | Bidirectional Encoder Representations from Transformers |
| BERTopic | topic modeling algorithm | BWH | Brigham and Women’s Hospital |
| CAR-T | chimeric antigen receptor T-cell | CI | confidence interval |
| Cox | Cox proportional hazards (model) | CPOE | Computerized provider order entry |
| CRPC | castration-resistant prostate cancer | CRS | Cytokine release syndrome |
| CTCAE | Common Terminology Criteria for Adverse Events | ECOV | Electronic chemotherapy order verification |
| EHR | Electronic health record | eMAR | Electronic medication administration record |
| F1 (F-measure) | harmonic mean of precision and recall | FDA | Food and Drug Administration |
| FN | false negative | FP | false positive |
| GPT | generative pre-trained transformer | GPT-4 | Generative Pre-trained Transformer 4 |
| HER2+ | human epidermal growth factor receptor 2-positive | HFS | hand-foot syndrome |
| HIPAA | Health Insurance Portability and Accountability Act | HOPA | Hematology/Oncology Pharmacy Association |
| HR | hazard ratio | ICANS | Immune effector cell–associated neurotoxicity syndrome |
| ICD | International Classification of Diseases | ICD-10 | International Classification of Diseases, 10th Revision |
| ICI | Immune checkpoint inhibitor | ILD | interstitial lung disease |
| IR | interventional radiology | irAE | Immune-related adverse events |
| JSON | JavaScript Object Notation | Kendall W | Kendall’s coefficient of concordance |
| LLM(s) | Large language models | LLT | Low Level Term |
| MedDRA | Medical Dictionary for Regulatory Activities | medspaCy | medical spaCy (clinical NLP toolkit) |
| MEPS | myocarditis, encephalitis, pneumonitis, and severe cutaneous adverse reactions | MGH | Massachusetts General Hospital |
| MoA | mechanism of action | MWA | microwave ablation |
| NCCN | National Comprehensive Cancer Network | NER | named entity recognition |
| NLP | Natural language processing | NPV | negative predictive value (mentioned in linked studies; if you cite them later in Results, include here) |
| OICI | Osaka International Cancer Institute | OncoBERT | oncology-domain BERT model |
| ONS | Oncology Nursing Society | p | p-value |
| PPV | Positive Predictive Value | Proximal management signal | an indirect marker of AE management (e.g., steroid initiation) when direct grade-to-action measurement isn’t available |
| PSM | propensity score matching | pt/pts | patients |
| PT | Preferred Term(s) | PubMedBERT | BERT pretrained on PubMed biomedical text |
| Q&A | question and answer | R11 | nausea and vomiting |
| R53 | malaise and fatigue | RAG | Retrieval-augmented generation |
| RT | radiotherapy | RW | real-world |
| RWE | Real-world evidence | SCAR | severe cutaneous adverse reaction |
| SITC | Society for Immunotherapy of Cancer | SOAP | Subjective, Objective, Assessment, Plan |
| T-DM1 | trastuzumab emtansine | T-DXd | trastuzumab deruxtecan |
| TI/AB | Title/Abstract (database search fields) | TN | true negative |
| TP | true positive | UCSF | University of California, San Francisco |
| Version pinning | fixing versions of CTCAE, prompts, model weights, and retrieval sources to ensure traceability and reproducibility | VUMC | Vanderbilt University Medical Center |
| Study Objective | Clinical Data Source and Size | LLM Approach | Metric and Performance | Clinical Impact/Advancements | Author and Year |
|---|---|---|---|---|---|
| Use Llama-3.1 to detect five therapy-related AEs from clinical notes and predict time-to-AE, validated against prospectively collected trial AE data. | Prospective AE gold standard from 1754 solid-tumor pts across 675 trials; note-level subset of 100 curated notes. | Llama 3.1 LLM AE annotation; CTCAE v5.0 definitions; note- and patient-level evaluation; Pearson R2 for time-to-AE. | Adrenal insuff: note sens/spec 100/97.8; patient 97.7/94.7; R2 98.2. Colitis: 66.7/99.0; 94.3/80.4; 89.2. Hyperthyroid: 57.1/100; 74.0/91.4; 98.7. Hypothyroid: 100/88.9; 88.1/74.0; 96.1. Pneumonitis: 76.9/97.7; 98.6/70.1; 83.9. | Demonstrates accurate, scalable LLM-based AE surveillance across cancers; enables translational toxicity research using routinely documented clinical text. | Bakouny et al., 2025 [34] |
| Identify ICI-induced irAEs from EHR text using LLM and assess patient impact (steroids, discontinuation) versus diagnosis codes. | Mayo Clinic EHR (2005–2021): unstructured notes + structured orders/codes; 9290 ICI-treated adults; manual-review MEPS cohort; comparators: diagnosis codes, medication orders. | Augmented curation using fine-tuned SciBERT sentence classifier (drug-to-phenotype) plus medication-administered classifier; synonym libraries; ensemble over phrase fragments to infer ICI → irAE causality and steroids. | Drug-to-phenotype acc 84.8%, precision/recall >85%; medication-administered acc 89%, precision/recall >88%; MEPS vs. manual: spec 0.858, sens 0.903, PPV 0.792, NPV 0.937, F1 0.844; steroid use 82%. | Outperformed diagnosis codes for causal irAE detection; quantified severity actions (steroids, discontinuation); enables scalable real-world pharmacovigilance and safety profiling for ICIs. | Barman et al., 2024 [35] |
| Assess GPT-3.5/4/4o for identifying immune-related adverse events from EHR notes and clinical trial narratives across institutions. | VUMC EHR (100 pts; 26,432 notes), UCSF EHR (70 pts; 487 notes), 7 Roche trials (272 pts; narratives); comparator: manual irAE annotations/MedDRA database. | Zero-shot GPT-3.5/4/4o (Azure OpenAI); prompt with irAE lists/synsets; JSON outputs; patient-level aggregation thresholds; organ-level mapping. | Patient-level micro-F1: VUMC 0.556–0.589; UCSF 0.581–0.662; Roche 0.535–0.620. Note-level (VUMC) irAE micro-F1 0.496–0.571; category micro-F1 0.611–0.656. | Generalizable LLM irAE phenotyping across EHRs and trials; scalable adverse-event surveillance from text; reduces manual review burden and supports pharmacoepidemiologic safety monitoring. | Bejan et al., 2025 [36] |
| Evaluate chatbot accuracy/completeness for irAE management across organ systems and scenarios. | 50 guideline-derived questions (10 irAE categories) + 20 patient-specific scenarios; 8 expert raters; comparator: ASCO/SITC/NCCN guidelines. | ChatGPT (GPT-4) and Google Bard; zero-shot Q&A; expert Likert ratings (1–4) for accuracy/completeness. | ChatGPT: mean accuracy 3.87, completeness 3.83; Bard: 3.50, 3.46; p < 0.001 overall. Ratings of 1: 0.3% vs. 1.1%. Scenarios: accuracy 3.73, completeness 3.61; Kendall W ~0.21–0.27. | Shows GPT-4 provides largely accurate, guideline-consistent irAE management guidance; supports clinician use with verification and highlights category-specific gaps. | Burnette et al., 2024 [37] |
| Evaluate GPT-4 for classifying chemotherapy-induced subjective toxicities vs. oncologists using CTCAE v5. | 30 fictitious audio cases (transcribed); 13 oncologists; comparator: oncologist mode/mean CTCAE grades. | Contextualized GPT-4 with CTCAE v5 reference; custom GPT; 10 runs/case; outputs general (none/mild/severe) and specific grades 0–4. | General-category accuracy 81.5–85.7%; specific-category accuracy 64.4–64.6%; mild errors 96%, severe errors 3.6–4%; false alarms 3–8.9%. | Demonstrates near-expert general severity classification from narratives; feasible LLM toxicity monitoring; needs improvement for fine-grained grading; basis for real-patient validation. | Ruiz Sarrias et al., 2024 [38] |
| Classify CTCAE esophagitis severity from radiotherapy clinic notes. | Gold: 1524 notes/124 lung pts; Silver: 2420 notes/1832 pts; External: 345 notes/75 esophageal pts; comparator: manual CTCAE labels & structured toxicity grades. | Encoder-only PubMedBERT fine-tuned; medspaCy sectionizer; CTCAE v5.0 mapping; silver-label augmentation; note → patient aggregation. | Note macro-F1: 0.92 (Task1), 0.82 (Task2), 0.74 (Task3); external 0.73/0.74/0.65. Patient macro-F1: 1.00/0.92/0.49 (lung) and 0.70/0.69/0.58 (esophageal). | First to automatically extract CTCAE severity from notes; supplements sparse ICD-10 coding; enables scalable radiotherapy (RT) toxicity surveillance and RWE. | Chen et al., 2023 [39] |
| Use BERT-based NLP plus clinical review to identify ADC-related AEs in HER2+ breast cancer. | Mayo Clinic EHR notes; T-DXd n = 266, T-DM1 n = 432; manual adjudication for ILD/colitis causality. | Fine-tuned BERT classifier on time-stamped notes with synonym lists; clinician adjudication of drug → AE causality. | Adjudicated ILD = 16 (14 T-DXd, 2 T-DM1); steroids 15/16 (94%); hospitalized 9/16 (56%); fatalities 3/16 (19%); colitis 0 attributable. | Demonstrates scalable AE surveillance from EHR text for ADCs; quantifies real-world ILD burden and management outcomes to support pharmacovigilance. | Chumsri et al., 2025 [40] |
| Assess GPT-3.5 feasibility for extracting post-ablation complications and local recurrence from oncology reports/notes. | Single-center EHR: 20 lung MWA patients; 104 radiology reports + 37 clinic notes; comparator: manual chart-review ground truth. | GPT-3.5 Turbo 16 k (Azure HIPAA) zero-shot; temperature 0; prompt for four binary outcomes (recurrence, pneumothorax, hemoptysis, hemothorax); Python pipeline. | Recurrence: acc 85%, recall 100%, precision 77%, F1 0.87. Pneumothorax: acc 85%, recall 67%, precision 100%, F1 0.80. Hemoptysis acc 95% (1 FP). Hemothorax acc 100%. | Demonstrates LLM-based extraction of oncology IR outcome AEs from routine text; supports faster registry curation and safety surveillance. | Geevarghese et al., 2025 [41] |
| Automate extraction of late RT-related toxicities from prostate cancer clinical notes using an LLM. | Single-center EHR; 177 pts; 1133 notes (>6 mo post-RT); 699 notes for optimization; 434-note validation; 12 GU/GI domains. | Teacher–student: Mixtral-8x7B extracts symptoms; GPT-4 refines prompts over 16 rounds/5 epochs; outputs positive/negative/neutral with rationale. | Single-symptom notes: acc 0.71, precision 0.82, recall 0.71, F1 0.73; multi-symptom best acc: hematuria 0.76, UC 0.70; per-symptom postrefinement acc 72–97%, overall ~84%. | Shows feasible LLM-based surveillance of late RT toxicities; iterative teacher–student refinement improves extraction; supports longitudinal toxicity monitoring in prostate cancer. | Ghanem et al., 2025 [42] |
| Evaluate GPT-4 for extracting CAR-T adverse events and building AE timelines from progress notes. | UCSF deidentified EHR; 4183 notes within 30 days post-CAR-T from 253 patients; ~10% (25 pts) manually validated by clinician. | GPT-4 (Azure HIPAA) zero-shot AE extraction for events prompting clinical intervention; BERTopic clustering to identify temporal patterns. | Manual validation accuracy 64%; 19 AE clusters; largest cluster (hyponatremia/leukocytosis/encephalopathy/neurologic) occurred mean 12.9 days post-CAR-T. | Shows feasibility of LLM-based AE surveillance and timeline construction after CAR-T, potentially reducing manual chart-review burden. | Guillot et al., 2024 [43] |
| Assess domain adaptation for ADE extraction from breast cancer EHR notes. | OICI pharmacy notes (Japanese): 1928 notes from 434 breast cancer patients (2019–2021); 1000 notes annotated; plus 1027 dummy case reports; comparator: held-out annotated test set. | Encoder-only BERT (cl-tohoku/bert-base-japanese-char-v2) fine-tuned for NER; domain adaptation with 100/500/800 in-domain notes; MedDRA normalization via Levenshtein matching. | Baseline F1 = 0.51; +40% F1 with 100 notes; best with 800 notes F1 = 0.84; 10,569 mentions normalized to 343 MedDRA PTs; regimen-wise ADE frequencies tabulated. | Demonstrates LLM-based ADE detection in Japanese EHRs; domain adaptation markedly improves performance; enables regimen-specific ADE surveillance to support pharmacovigilance. | Andrade et al., 2024 [44] |
| Compare GPT-4 oncology drug safety information with FDA labels. | 53 solid-tumor drugs with 2020–2022 approvals; comparator: FDA package inserts; two-physician review. | GPT-4 Q&A for four items; safety items: common AEs ≥20% and warnings/precautions; repeat-query check for discordances. | Indications/MoA acc = 100%; AEs correct = 53%, incorrect = 47% warnings/precautions correct = 32%, incorrect = 68% (36/53); repeated queries changed AE answers 76% and warnings 53%. | Reveals omissions/variability in GPT-4 safety outputs; cautions against relying on GPT as a primary oncology drug safety source. | Hundal et al., 2024 [45] |
| Detect capecitabine-induced HFS from EHR notes with BERT-based NER and evaluate celecoxib’s preventive effect. | University of Tokyo Hospital EHR (2004–2021): 44,502 cancer pts; 669 capecitabine users; PSM 636 vs. 636; celecoxib 31 vs. 31; manual validation 2606 notes/62 pts; comparator: manual annotations. | MedNERN-CR-JA (Japanese BERT) NER with ADR normalization; positive-attribution filtering; patient-level aggregation; propensity score matching; Cox models for HFS risk. | Patient-level: precision 0.875, recall 1.000, F1 0.933; document-level: precision 0.920, recall 0.857, F1 0.888; capecitabine vs. none HR 1.93 (95% CI 1.48–2.52); celecoxib HR 0.51 (0.24–1.07). | Demonstrates reliable LLM-based detection of symptomatic AE (HFS) from real-world Japanese EHR and enables retrospective safety evaluation suggesting celecoxib’s protective trend. | Tsuchiya et al., 2025 [46] |
| Detect symptomatic AEs from oncology notes and estimate drug-associated time-to-event risk. | University of Tokyo Hospital EHR (progress, discharge, nursing notes) 2004–2021; N = 39,319 cancer patients; PSM comparator cohorts. | MedNERN-CR-JA (Japanese BERT) NER with AE dictionary; onset = first symptom after drug; 1:1 PSM; Cox models. | Significant HRs: anthracyclines → cardiotoxicity 1.21; oxaliplatin → peripheral neuropathy 2.56; capecitabine → HFS 1.93; detectability varies by note source (Figure 2). | Validates large-scale LLM-based AE surveillance and time-to-onset analysis from notes, informing pharmacovigilance beyond codes/labs and emphasizing data-source choice. | Tsuchiya et al., 2025 [47] |
| Predict nausea/vomiting and fatigue/malaise events from oncology clinical notes. | Centre Léon Bérard EHR (France); 140,523 patients; 2,515,957 notes; labels: ICD-10 events within 90 days of notes. | OncoBERT (BERT, French) pretrained on local notes; time-encoding of sequential reports; fine-tuned vs. DrBert/K-memBERT; random undersampling. | Macro-AUPRC: 0.58 (OncoBERT) vs. 0.50 (open-source baseline) for R11/R53 prediction. | Demonstrates large-scale LLM prediction of common oncologic AEs from notes to enable early alerts and reduce AE-related hospitalizations. | Vienne et al., 2024 [48] |
| Identify adverse events from community-pharmacy records of ARAT-treated prostate cancer outpatients using BERT NER. | Nakajima Pharmacy Group records (Japan, 2020–2021); Step1: 1008 annotated notes; Step2: 161 ARAT patients, 2193 assessment notes; comparator: manual annotation. | MedNER-CR-JA (Japanese BERT) NER with factuality tags (positive/suspicious/negative/general); preprocessing; Fleiss κ = 0.62; evaluation by exact/position matches. | Exact-match F1 = 0.72 (positive 0.70; negative 0.78); position-match F1 = 0.86; 1900 symptom tags from 2193 notes (32.8% positive). | Enables scalable AE surveillance from community-pharmacy narratives, capturing drug-specific ARAT AE profiles and documenting absence of severe AEs for outpatient safety monitoring. | Yanagisawa et al., 2025 [49] |
| Test LLM generalizability for drug–ADE relation extraction across datasets. | The Ohio State University (OSU) ICI notes: 1394 notes/47 pts; 189 relations. n2c2: 505 summaries; 1355 relations; gold standards = manual annotations. | Encoder LLMs (BERT, ClinicalBERT) on sentence text between drug and ADE; compared vs. SVM/CNN/BiLSTM. | ClinicalBERT F-score inter-dataset: 0.78 (OSU → n2c2), 0.74 (n2c2 → OSU); intra-dataset best on n2c2: 0.87. | Shows superior cross-site ADE detection from notes with ClinicalBERT, supporting scalable oncology pharmacovigilance. | Zitu et al., 2023 [50] |
| Compare NLP/LLM Augmented Curation vs. ICD/manual review for detecting ICI irAEs and associated actions. | Mayo Clinic EHR; ~9000 ICI-treated pts; gold-standard manual A.C.H. cohort; 540 patients with A.C.H.; comparator: ICD codes. | SciBERT-based Augmented Curation with sentence extraction/entity classification for drug–AE causality; assesses steroid/2 L immunosuppressant use and ICI discontinuation; time-to-curate benchmark. | F1 = 0.85 for drug–AE; higher sensitivity/PPV/NPV than ICD; processing ~10 min for 9000 pts vs. ~9 weeks manual. | Quantifies irAE management actions (steroids 79%, 2 L immunosuppressant 6.1%, ICI discontinuation 7.7%); demonstrates scalable, accurate safety surveillance from unstructured notes. | Block et al., 2023 [51] |
| Evaluate an LLM vs. ICD codes for detecting severe ICI-related irAEs from EHR text. | MGH inpatient EHR: 7555 admissions (2011–2023); external validation BWH: 1270 admissions (2018–2019); comparator: manual adjudication and ICD codes. | Open-source Mistral OpenOrca with retrieval-augmented generation; four irAE prompts; vector database; zero-shot yes/no outputs; ~9.53 s/chart. | MGH: mean sens 94.7%, spec 93.7%, PPV 15.2%, NPV 99.9%; significant gains vs. ICD for hepatitis, myocarditis, pneumonitis. Validation: sens 98.1%, spec 95.7%; adjusted sens/spec 99.2%/97.6% (model-detected). | Outperforms ICD coding and accelerates adjudication; generalizes across institutions; identifies missed irAEs, enabling scalable safety surveillance. | Sun et al., 2024 [52] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zitu, M.M.; Manne, A.; Zhu, Y.; Rahat, W.B.; Binkheder, S. Large Language Models for Drug-Related Adverse Events in Oncology Pharmacy: Detection, Grading, and Actioning. Pharmacy 2025, 13, 176. https://doi.org/10.3390/pharmacy13060176
Zitu MM, Manne A, Zhu Y, Rahat WB, Binkheder S. Large Language Models for Drug-Related Adverse Events in Oncology Pharmacy: Detection, Grading, and Actioning. Pharmacy. 2025; 13(6):176. https://doi.org/10.3390/pharmacy13060176
Chicago/Turabian StyleZitu, Md Muntasir, Ashish Manne, Yuxi Zhu, Wasimul Bari Rahat, and Samar Binkheder. 2025. "Large Language Models for Drug-Related Adverse Events in Oncology Pharmacy: Detection, Grading, and Actioning" Pharmacy 13, no. 6: 176. https://doi.org/10.3390/pharmacy13060176
APA StyleZitu, M. M., Manne, A., Zhu, Y., Rahat, W. B., & Binkheder, S. (2025). Large Language Models for Drug-Related Adverse Events in Oncology Pharmacy: Detection, Grading, and Actioning. Pharmacy, 13(6), 176. https://doi.org/10.3390/pharmacy13060176

