Large Language Model Agents for Biomedicine: A Comprehensive Review of Methods, Evaluations, Challenges, and Future Directions

Xu, Xiaoran; Sankar, Ravi

doi:10.3390/info16100894

Open AccessReview

Large Language Model Agents for Biomedicine: A Comprehensive Review of Methods, Evaluations, Challenges, and Future Directions

by

Xiaoran Xu

and

Ravi Sankar

^*

Department of Electrical Engineering, University of South Florida, Tampa, FL 33620, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(10), 894; https://doi.org/10.3390/info16100894

Submission received: 17 September 2025 / Revised: 10 October 2025 / Accepted: 12 October 2025 / Published: 14 October 2025

Download

Browse Figures

Versions Notes

Abstract

Large language model (LLM)-based agents are rapidly emerging as transformative tools across biomedical research and clinical applications. By integrating reasoning, planning, memory, and tool use capabilities, these agents go beyond static language models to operate autonomously or collaboratively within complex healthcare settings. This review provides a comprehensive survey of biomedical LLM agents, spanning their core system architectures, enabling methodologies, and real-world use cases such as clinical decision making, biomedical research automation, and patient simulation. We further examine emerging benchmarks designed to evaluate agent performance under dynamic, interactive, and multimodal conditions. In addition, we systematically analyze key challenges, including hallucinations, interpretability, tool reliability, data bias, and regulatory gaps, and discuss corresponding mitigation strategies. Finally, we outline future directions in areas such as continual learning, federated adaptation, robust multi-agent coordination, and human AI collaboration. This review aims to establish a foundational understanding of biomedical LLM agents and provide a forward-looking roadmap for building trustworthy, reliable, and clinically deployable intelligent systems.

Keywords:

large language models; biomedical agents; tool-augmented reasoning; multi-agent systems; trustworthiness AI

1. Introduction

Large language models (LLMs), built on Transformer architectures and pre-trained on massive text corpora [1], have achieved remarkable success in natural language understanding and generation tasks [2]. In recent years, their role has evolved beyond static text generation to encompass interactive reasoning, planning, and tool use core capabilities that characterize intelligent agentic systems [3,4]. This shift has led to the emergence of LLM-driven agents, capable of acting autonomously or semi-autonomously to complete complex, multi-step tasks.

In the biomedical domain, this transformation is particularly promising. LLM agents are now being explored for applications such as clinical decision support, scientific literature analysis, drug discovery, and workflow automation [2,4]. Unlike traditional models that simply output responses, biomedical LLM agents operate with enhanced autonomy: they can invoke external tools (e.g., medical calculators, retrieval APIs), maintain contextual memory across interactions, and execute goal-directed behaviors in high-stakes environments [3,4]. For example, agents like SourceCheckup [5] and CalcQA [6] have demonstrated capabilities in source-grounded reasoning and numerical tool usage, while MedJourney [7] and AgentClinic [8] explore collaborative and multi-agent workflows in virtual clinical scenarios.

We define the biomedical LLM agent as an artificial intelligence system that leverages a large language model as its core cognitive engine, and is equipped with perception, planning, memory, and actuation modules to solve complex problems within biomedical and clinical contexts [3,9,10,11]. These agents may support or automate processes such as diagnosis, literature synthesis, physician patient interaction, and trial matching, and are increasingly viewed as foundational building blocks for next-generation clinical AI systems.

However, the deployment of such agents in the biomedical domain introduces a unique set of challenges. Given the high-stakes nature of medicine, hallucinations, i.e., the generation of plausible but inaccurate or fabricated information, pose substantial risks [12]. Furthermore, biomedical AI systems must meet stringent requirements for data privacy (e.g., HIPAA, GDPR), algorithmic fairness, and interpretability, all while conforming to evolving regulatory frameworks such as FDA guidelines and the EU AI Act [13]. These domain-specific constraints necessitate new design principles, evaluation metrics, and deployment safeguards that go well beyond those used in general purpose NLP systems.

Despite these advances, there remains a lack of comprehensive surveys examining LLM agents of biomedical and clinical settings (systems with agentic reasoning, memory, tool orchestration, and planning). This paper addresses this gap by providing a focused review of the methods, evaluations, and deployment issues associated with biomedical LLM agents.

Compared with prior surveys focusing on general medical LLMs or conversational agents [2,12], our review is distinguished by its agent-centric framing. Rather than treating biomedical LLMs as static models, we conceptualize them as agentic systems with capabilities in reasoning, planning, memory, and tool orchestration. We introduce a multi-layer taxonomy spanning (1) core agentic capabilities, (2) enabling methodologies such as retrieval-augmented generation, fine-tuning, and multimodal integration, and (3) biomedical applications and evaluation frameworks. This perspective highlights how traditional NLP systems are evolving toward autonomous, tool-using biomedical agents, providing a unified conceptual and methodological lens not covered in earlier surveys.

Specifically, we review recent progress in biomedical LLM agents published from 2023 through July 2025, emphasizing agent-centric design patterns, enabling techniques, and domain-specific adaptations. While earlier work on biomedical NLP and knowledge-driven systems laid important groundwork, large-scale agentic applications of LLMs only began to emerge after 2023 with the advent of more capable foundation models and tool-use frameworks. We therefore focus on this recent period, during which the field has rapidly evolved [13]. We analyze challenges such as hallucination, interpretability, tool reliability, data bias, and regulatory compliance, and explore mitigation strategies proposed in current literature. Finally, we identify promising directions for future development.

To frame the scope of this review, we begin by summarizing recent agentic advances and systematically categorizing core design patterns, applications, and challenges. The structure of this review is as follows. Section 3 examines core methodologies and system architectures, including single- and multi-agent designs, tool integration, fine-tuning, and multimodal reasoning. Section 4 discusses evaluation and benchmarking strategies. Section 5 reviews key challenges and mitigation techniques. Section 6 outlines future research directions. Section 7 concludes this review.

Accordingly, we pose the following research questions:

(1): What are the main LLM agent technical paradigms in the biomedical field, and how do they satisfy requirements for subject-matter expertise, interpretability, and regulatory compliance?
(2): In clinical and research settings, how can we comprehensively assess LLM agent performance, reliability, and safety using both automated metrics and user studies?
(3): What are the key challenges currently faced by biomedical LLM agents in terms of knowledge updating, reasoning interpretability, resource constraints, data privacy, and ethical compliance?

2. Data Preparation

Having outlined the motivation and scope of this survey, we now describe the literature collection methodology that grounds our analysis. This review is grounded in a structured and comprehensive literature search spanning multiple authoritative sources, including PubMed, arXiv, bioRxiv, medRxiv, and supplementary searches via Google Scholar. The initial search strategy involved various keyword combinations such as “large language model agents”, “agentic AI”, and “autonomous research”, intersected with biomedical-specific terms including “drug discovery”, “clinical diagnostics”, “genomics”, “personalized medicine”, and “biomedicine”. The search covered publications from 2023 to June 2025, aligning with the timeline of major developments in biomedical research of LLM agents.

It is worth noting that prior to 2023, seminal work in biomedical NLP and knowledge-driven AI laid the groundwork for current progress. Early domain-specific models such as BioBERT [14] and PubMedBERT [15] were pre-trained on biomedical corpora to enhance text understanding tasks, such as entity recognition, relation extraction, and question answering, and demonstrated the value of domain adaptation and structured knowledge integration. Foundational efforts in knowledge-driven discovery, such as Swanson’s [16] hypothesis generation framework in the late 1980s and its later extension with biomedical knowledge graphs in the 2010s [17], further illustrated how structured reasoning could aid hypothesis formation in biomedicine. However, these systems and models were largely static and lacked the general reasoning, interactivity, and tool-use capabilities that characterize agentic behavior. In contrast, foundation models developed after 2023, including GPT-4, Claude 3.5, and Gemini 1.5 Pro, extend beyond text understanding to support multimodal reasoning, long-context comprehension, and function calling. These advances, combined with instruction-following and reflective mechanisms, have transformed modern LLMs into flexible cognitive backbones capable of autonomous planning and collaboration. With the emergence of such foundation models beginning with GPT-3 in 2020 and accelerated by ChatGPT in late 2022 [18], agent-centric biomedical applications have rapidly expanded, motivating our review focus on the 2023–2025 period. A multistage screening procedure was implemented. In the first stage, non-scholarly sources such as news articles, blog posts, and meta-reviews were excluded. In the second stage, articles were retained based on direct relevance to biomedical LLM agents exhibiting core agentic traits, such as planning, memory, interaction, and tool utilization, and demonstrating practical application in clinical or biomedical research con- texts. Priority was given to original research and technically detailed works, particularly those offering empirical evaluation or system-level contributions.

Rather than aiming for exhaustive coverage, this review focuses on a representative and high-quality subset of publications that collectively reflect the state of the art and emerging trends in the development and deployment of biomedical LLM agents. The overall search and screening process is outlined in Figure 1.

3. Methodology and Architecture of Biomedical LLM Agents

With the foundational literature identified, we next examine the technical foundations that enable agentic behavior in biomedical domains.

3.1. Fundamental Concepts: From LLMs to Agents

LLMs serve as powerful natural language processing tools and form the foundation of agents [2]. However, achieving true agentic behavior requires more than the LLM alone. In the agent paradigm, the LLM acts as the core “brain,” while additional capability modules augment its functionality. First, perception enables the agent to interpret environmental inputs, ranging from textual and visual data to sensor readings, thus grounding its subsequent reasoning [13]. Planning mechanisms allow the agent to devise multi-step strategies, dynamically selecting actions that align with its objectives and adapting to the current state of its surroundings [4,19]. Memory modules further enhance performance by maintaining both short-term context for ongoing tasks and long-term repositories of accumulated knowledge and experience [13]. Finally, action or tool-use capabilities empower the agent to carry out its plans, whether by invoking external services (such as APIs, databases, or code interpreters) or by executing physical operations in embodied scenarios [4]. Many such agents are built upon broader foundation models, which acquire cross-task generalization capabilities through self-supervised learning on massive datasets [20]. Through the integration of these components, LLM agents transcend passive response generation, proactively engaging with their environment to achieve complex, goal-driven outcomes. To systematically organize the architectural space of biomedical agents, we present a unified conceptual landscape in Figure 2, which outlines the multi-level structure spanning core agentic capabilities, system architectures, enabling techniques, and application scenarios. The agent core layer defines fundamental properties such as goal-driven behavior, interactivity, adaptability, and single- versus multi-agent paradigms; the key methodologies layer encompasses enabling techniques including multimodal integration across text, images, and genomic data, domain- specific fine-tuning, retrieval-augmented generation, and tool use; and the application layer illustrates downstream biomedical use cases ranging from clinical decision support and drug discovery to team collaboration and hybrid scenarios. Cross-cutting challenges such as data quality, catastrophic forgetting, and hallucination, together with performance-enhancement strategies, are also highlighted. Collectively, the figure provides a system-level view that links agentic design principles, methodological enablers, and biomedical applications. Table 1 lists representative biomedical LLM agents and their core components.

3.2. LLM Agent Architecture

According to the number of agents and the mode of collaboration, biomedical LLM agent systems can be roughly divided into single-agent systems and multi-agent systems (MAS).

3.2.1. Single-Agent Systems

Single-agent biomedical LLM agents aim to enhance the intelligence, autonomy, and contextual reasoning capabilities of a single foundational model through several tightly integrated mechanisms.

First, advanced prompting strategies, such as Chain-of-Thought (CoT) [2] guides the model to reason step by step through explicit intermediate logic, ReAct (Reasoning and Acting) [13] combines reasoning with external actions such as tool use or API calls to improve factual grounding, and Tree-of-Thought (ToT) [34] expands this idea by exploring multiple reasoning branches hierarchically before selecting the optimal path, are employed to guide the LLM in decomposing complex tasks into structured subgoals. These methods enable the agent to formulate multi-step plans and execute them in a coherent and interpretable manner, facilitating logical consistency and outcome alignment in biomedical scenarios [4].

To mitigate the limitations of the context window inherent in most LLMs, memory integration has become a core component. Short-term memory modules are used to maintain active dialogue or task states, while long-term memory allows for the retrieval of accumulated domain knowledge [13]. In many systems, retrieval-augmented generation (RAG) frameworks are incorporated to dynamically access external sources, such as biomedical literature or patient records, further enriching the model’s memory with real-time, factual information [35]. Another key functionality is reflection and self-correction. Biomedical LLM agents are increasingly equipped with mechanisms to assess their own outputs and revise suboptimal responses based on internal or external feedback [35]. This process may involve internal self-evaluation or external validation via tools or other agents. For instance, BioDiscovery Agent employs a critic agent an auxiliary LLM to review and challenge its reasoning and conclusions [26]. Similarly, CRISPR-GPT uses state machines to structure agent behavior and supports multiple rounds of interaction to refine outputs in gene editing design tasks [25]. Together, these strategies enable single-agent systems to exhibit more robust, interpretable, and error-tolerant behavior in biomedical applications.

3.2.2. Multi-Agent Systems (MAS)

While single-agent systems emphasize internal consistency and autonomy, more complex biomedical tasks often require collaborative intelligence. MAS leverages the collective capabilities of multiple LLM agents to collaboratively solve complex biomedical tasks that are beyond the scope of a single agent. This paradigm is inspired by real-world team-based decision-making processes commonly seen in clinical practice and biomedical research.

Collaboration among agents in MAS can be orchestrated through different architectural paradigms. In centralized configurations, a central coordinator or trainer agent is responsible for decomposing tasks, assigning subtasks, and managing the communication flow [36]. For instance, DDO divides the diagnostic workflow into two specialized agents, one for symptom inquiry and another for disease diagnosis, who collaborate through structured dialogue to emulate physician-patient interactions [37]. SM-MAS further demonstrates a scalable and modular MAS framework tailored for adaptive clinical decision-making [38]. In addition, CPDE (Centralized Planning with Decentralized Execution) describes systems in which a central controller plans the overall workflow, while local agents execute tasks independently; conversely, DPDE (Decentralized Planning and Decentralized Execution) allows each agent to plan and act autonomously, improving scalability at the cost of coordination complexity. It enables dynamically composed agent teams (e.g., symptom extractors, reasoning agents, verifiers) to interact under a shared protocol, leading to improved accuracy and transparency compared to single-agent baselines.

Decentralized MAS architectures, by contrast, promote autonomous interaction among agents without centralized control. These can be further categorized into revision-based optimization (where agents iteratively refine each other’s outputs) and protocol-based communication (where information is shared according to pre-defined rules) [39]. In practice, hybrid systems that blend centralized planning with decentralized execution have demonstrated superior flexibility and scalability, particularly in complex, multi-step biomedical scenarios [40]. Planning strategies in MAS typically fall into two main categories: centralized planning with decentralized execution (CPDE), and fully decentralized planning and execution (DPDE), each offering different trade-offs between control and autonomy [41]. Communication between agents is facilitated either explicitly via structured message passing or implicitly, through shared memory or state representations [41]. Efficient communication protocols, mediator agents, and verification mechanisms are often incorporated to enhance cooperation quality and minimize the propagation of hallucinations or inconsistencies.

Overall, multi-agent biomedical systems introduce a powerful abstraction for simulating collaborative decision-making, distributing cognitive load, and improving robustness through agent specialization. These frameworks are particularly valuable for tasks requiring multiple expert perspectives, such as rare disease diagnosis, clinical trial matching, and cross-modal reasoning.

3.3. Methodology of Biomedical Agents

Beyond agent architecture, enabling techniques such as re-training, fine-tuning, tool use, and multimodal processing further define agent functionality. To enable LLM agents to work effectively and reliably in the biomedical field, researchers have adopted various key technologies to enhance their capabilities and overcome inherent limitations.

3.3.1. Retrieval-Augmented Generation

RAG is a foundational technique that enhances the factual grounding and domain specificity of LLM agents by integrating them with external knowledge sources [42]. This approach addresses two core limitations of general-purpose LLMs: outdated knowledge due to static pretraining and insufficient coverage of biomedical domain-specific concepts. By retrieving relevant and up-to-date information such as PubMed abstracts, clinical guidelines, electronic health records (EHRs), medical knowledge graphs, or real-time web content RAG empowers agents to produce contextually accurate and verifiable responses [35].

The typical RAG pipeline involves several stages: query rewriting, document preprocessing (e.g., chunking), indexing (e.g., embedding-based retrieval or maximal marginal relevance ranking), retrieval, and response generation. Prompt engineering and structured integration of retrieved content are also critical to ensure coherence and factual alignment during response synthesis [42].

Biomedical agents often rely on RAG not only for evidence retrieval but also for tool integration. For instance, SourceCheckup employs RAG to trace whether LLM-generated statements are supported by citations, enabling citation-level fact-checking in medical QA systems [5]. Multi-agent systems further demonstrate the power of division of labor in retrieval workflows. In Fan et al.’s multi-agent normalization framework [10], specialized agents collaboratively perform retrieval, evidence expansion, and decision-making for biomedical entity linking tasks. Overall, RAG provides a scalable mechanism for reducing hallucination, enhancing trustworthiness, and ensuring citation traceability in biomedical LLM agents.

3.3.2. Fine-Tuning and Domain Adaptation

While retrieval expands an agent’s knowledge access, fine-tuning customizes its core behavior for domain-specific reasoning. Although general-purpose LLMs demonstrate impressive linguistic capabilities, their performance in biomedical contexts is often hindered by limited domain-specific knowledge and contextual understanding [11]. Fine-tuning addresses this gap by adapting pre-trained models to biomedical tasks through additional supervised or instruction-based training on curated corpora such as PubMed abstracts, clinical notes, or medical textbooks [2]. This process enhances the model’s grasp of clinical terminology, biomedical reasoning patterns, and task-specific discourse.

Two main fine-tuning strategies are commonly employed. Full-parameter fine-tuning adjusts all weights of the model, typically requiring significant computational resources and large labeled datasets. In contrast, parameter-efficient fine-tuning (PEFT) methods such as adapters, LoRA, or pre-fix tuning update only a small subset of parameters, offering cost-effective alternatives for institutions with limited resources [43]. For example, Taiyi is a bilingual biomedical LLM trained using a two-stage fine-tuning approach that distinguishes generative from discriminative tasks across over 140 Chinese and English datasets [30].

Fine-tuning is often combined with retrieval techniques to improve robustness. MedBioLM integrates supervised fine-tuning and RAG to support question answering across diverse biomedical topics [31]. However, fine-tuning also presents several challenges. The most prominent include the scarcity of high-quality annotated biomedical data and the risk of catastrophic forgetting, whereby the model loses previously acquired general knowledge during task-specific training [44]. Addressing these issues requires continual learning strategies, domain-aware data curation, and hybrid training pipelines that preserve generalization capabilities while specializing in biomedical semantics.

3.3.3. Tool Use

Complementary to static knowledge adaptation, tool use empowers agents to interact with external systems and perform grounded actions. Despite their impressive linguistic and reasoning abilities, LLMs exhibit significant limitations in performing precise calculations, accessing up-to-date information, or interacting with external systems [6]. To address these gaps, tool use and function-calling capabilities have been integrated into biomedical LLM agents, enabling them to invoke APIs, access structured databases, execute code, or interface with domain-specific software modules [4]. These interactions allow agents to go beyond text-only responses and operate as interactive systems capable of performing verifiable, real-world tasks.

One prominent application is in clinical computation, where models are tasked with executing medical scoring systems or risk prediction algorithms. While LLMs can reason about medical concepts, their arithmetic reliability is often poor. Tools such as OpenMedCalc, code interpreters, and mathematical plugins have been shown to drastically reduce error rates [45]. For instance, Goodell et al. demonstrated that integrating external calculation tools with LLMs reduced the error rate by 5.5 times for LLaMA and 13 times for GPT models [4]. The CalcQA framework further improves execution fidelity through flexible tool orchestration and unit-aware conversions [6]. Building on this, ReflecTool adds a self-reflection module that guides tool use based on past outcomes, improving reliability across repeated clinical tasks [46]. MedOrch further enhances orchestration across iterative planning rounds and pipeline components [47], while MMedAgent introduces modular tool invocation tailored for specialized biomedical tasks [48].

Beyond numerical reasoning, biomedical agents employ tool calling in tasks such as literature and gene retrieval via PubMed or NCBI APIs [4,26], chemical structure analysis via cheminformatics packages [20], and automated experimentation through lab interface protocols, or interaction with structured EHR databases [49]. In pharmacovigilance, for example, agents use specialized toolchains to detect adverse drug events by combining retrieval, extraction, and interpretation modules [50]. Frameworks like LangChain and AutoGen facilitate modular integration of these tools, allowing agents to compose workflows dynamically and respond adaptively to task-specific requirements [35].

Tool-augmented agents bridge the gap between LLM reasoning and biomedical task execution, allowing for grounded, auditable, and real-time interactions with the physical and digital healthcare environment.

3.3.4. Multimodal Integration

In real-world settings, biomedical reasoning must span text, imaging, EHR, and genomic data, necessitating multimodal integration. Biomedical data is inherently multimodal, encompassing clinical narratives, imaging (e.g., X-rays, CT, MRI), structured data from electronic health records (EHR), genomic sequences, and physiological signals [51]. To effectively interpret and reason across these diverse modalities, LLM-based agents increasingly integrate multimodal learning techniques and architectures.

Two primary architectural paradigms have emerged. One is the CLIP-style model, which uses contrastive learning to align images and text into a shared representation space, enabling efficient retrieval and classification across modalities [51]. The other is LLM-centric, where non-text modalities are encoded by specialized encoders (e.g., visual or molecular encoders) and projected into the language model’s embedding space, allowing agents to perform unified reasoning over multimodal inputs. Representative systems including LLaVA-Med [52] and GeneVerse [53] enable agents to interpret imaging and genetic data through joint embedding and instruction-following capabilities.

MedChat further demonstrates this integration by coordinating vision agents and text-based agents under a central controller to support cross-modal diagnostic reasoning [54]. Building upon such simulated frameworks, several multimodal biomedical agents have also been validated using real-world clinical datasets, highlighting their practical feasibility. For instance, Ferber et al. [55] presented a deployed oncology decision-support agent that combined GPT-4 with MedSAM for image segmentation, a transformer for pathology analysis, and OncoKB for knowledge grounding, achieving 91% diagnostic accuracy on real patient cases. Similarly, AgentClinic [8] integrates multimodal EHR, imaging, and dialogue data from the MIMIC-IV dataset to evaluate agents in realistic clinical simulations, while TxAgent [56] was assessed on clinical guideline corpora and therapy recommendation tasks using actual drug–disease treatment records. Collectively, these evaluations demonstrate that multimodal biomedical agents are progressing from conceptual designs toward clinically validated, deployable systems.

Beyond clinical diagnostics, multimodal integration is crucial in biomedical research. For instance, cheminformatics tools are used to process molecular structures for drug screening [20], and multimodal agents are applied in pharmacovigilance to detect adverse drug events (ADEs) through the interaction of retrieval, extraction, and interpretation modules [50]. In drug discovery and genomics, agents like those developed by Liu et al. [57,58] and Xu et al. [59] process graph-structured molecular data, literature, and omics features simultaneously.

Multimodal capabilities are also essential in task pipelines involving tool chaining and hierarchical planning. For example, TxAgent [56] enables multi-step treatment recommendation by integrating medical record analysis, clinical guideline retrieval, drug database querying, and executable scoring tools. Related systems such as ClinicalAgent [24] and ColaCare [60] extend this pipeline with personalized reasoning and cross-modality verification.

Finally, robust multimodal integration contributes significantly to agent trustworthiness and robustness. Recent studies like Yi et al. [61] and Almansoori et al. [62] highlight the importance of modality alignment, hierarchical fusion, and adaptive control in building safe and interpretable biomedical agents.

4. Performance Evaluation and Benchmarking

Having detailed core agent methodologies, we now turn to evaluating how biomedical LLM agents perform across realistic tasks and scenarios. As biomedical LLM agents continue to evolve, it is increasingly essential to develop rigorous and comprehensive evaluation methodologies that reflect their capabilities, reliability, and safety in real-world scenarios.

4.1. The Need for Agent-Specific Evaluation

Traditional NLP benchmarks and static QA datasets such as MedQA (United States Medical Licensing Examination) are no longer adequate for evaluating LLM-based agents [63]. These agents operate beyond passive language modeling; their core value lies in dynamic decision making processes, including sequential reasoning, adaptive planning in evolving environments [64], external tool usage for calculations or database access [8], multi-step inference in complex contexts [64], and collaborative interactions with users or other agents [8]. As a result, evaluation has shifted from assessing static knowledge recall to measuring task completion success, reasoning quality, and agent behavior under interactive conditions.

4.2. Key Benchmarks for Biomedical LLM Agents

To meet these evaluation needs, several benchmarks have been proposed that simulate real-world biomedical applications across diverse dimensions of agent capability. These can be broadly grouped into benchmarks for interactive clinical simulation, tool-augmented reasoning, and trustworthiness or domain-specific assessment.

Benchmarks for interactive clinical simulation include AgentClinic [8], which evaluates agent performance in multimodal, role-driven clinical scenarios where agents act as doctors, patients, or moderators. It incorporates datasets such as AgentClinic-MedQA [8], NEJM, and MIMIC-IV [65], drawn from USMLE questions, clinical image challenges, and real-world EHRs. MedJourney [7] is a Chinese-language benchmark designed to assess LLM performance throughout the entire patient journey from diagnosis to follow-up in 12 subtasks, including department recommendation, dialogue summarization, and medication Q&A. It combines automated metrics (e.g., BLEU, recall) with human or LLM-based evaluations.

For tool-augmented reasoning and execution, CalcQA [6] evaluates agents’ ability to interpret clinical contexts, select appropriate calculators, convert units, and deliver final assessments. It comprises 44 calculators, 237 unit conversions, and 100 matched real-world cases. MedAgentBench [63] benchmarks agent interaction with FHIR-compliant EHR systems through 300 structured tasks across 10 clinical categories. The generalist EHR agent developed by Song et al. [66] similarly assesses clinical data manipulation across a 100-patient dataset.

In the realm of trustworthiness and specialized domains, MedHal [67] targets hallucination detection in clinical and QA contexts, while SourceCheckup [5] evaluates the factual integrity of generated content by measuring citation validity and source alignment. BixBench [68] focuses on bioinformatics applications, such as gene function prediction and molecular structure analysis. ArgMed-Agents [28] tests explainable decision-making via structured argumentation frameworks for clinical reasoning.

These benchmarks collectively assess biomedical LLM agents along different axes, including reasoning accuracy, tool use fidelity, and epistemic reliability. However, most are limited to static or task-specific evaluations and insufficiently capture dynamic agentic capabilities such as planning, memory, and tool orchestration. The development of holistic, standardized benchmark suites remains a critical priority. Table 2 provides an overview of major biomedical agent evaluation benchmarks.

4.3. Evaluation Indicators and Methods

Evaluating biomedical LLM agents requires a multifaceted set of metrics and methodologies to capture their complex behaviors across different tasks and settings. Task success rate [63] assesses whether the agent ultimately achieves its intended goal, such as issuing a correct diagnosis or generating an actionable experimental plan. Accuracy [4] is commonly applied in classification and QA tasks to measure the proportion of correct outputs. Natural language generation (NLG) metrics such as BLEU, ROUGE, and BERTScore are used to quantify the similarity between generated and reference texts, particularly for summarization or open-ended generation tasks [7].

Qualitative metrics [2], often based on human expert judgment (e.g., clinicians), employ Likert scales to evaluate dimensions such as clarity, completeness, usefulness, relevance, coherence, safety, and bias. Although subjective, such assessments provide clinical insight that automated metrics frequently overlook, but they also face challenges like inter-rater variability and high annotation costs. The LLM-as-judge approach [7] offers a scalable alternative by leveraging powerful models such as GPT-4 to evaluate other agents’ outputs, though questions around its objectivity and reliability remain.

In addition, agent-specific metrics [11] are employed to evaluate planning quality, tool invocation accuracy, interaction efficiency (e.g., number of turns to task completion), and robustness to adversarial or biased inputs. Cost and efficiency indicators [39] track operational resource usage, including API call frequency, latency, and computational overhead, which are critical for assessing practical feasibility.

4.4. Performance Comparison

Recent evaluation studies offer several important insights into the performance dynamics of biomedical LLM agents. First, agent-based evaluation proves substantially more difficult than static QA. Models that perform well on datasets like MedQA often experience dramatic drops in accuracy, sometimes falling below one-tenth when the same questions are embedded in interactive diagnostic scenarios [63,69]. This highlights the inadequacy of static benchmarks in capturing agent capabilities within simulated clinical workflows.

Second, significant performance variation exists across models. Proprietary LLMs such as GPT-4 and Claude 3.5 consistently outperform open-source counterparts in complex agent tasks [63], particularly those requiring multi-step reasoning, tool usage, or sustained interaction. This underscores the current limitations of open models in agentic reasoning environments.

Third, performance is notably enhanced through task-specific augmentation strategies. Retrieval-augmented generation (RAG) has been shown to significantly improve accuracy on clinical tasks [42], while specialized tool use, such as OpenMedCal, improves computational precision and reduces error rates. Moreover, multi-agent frameworks, including MedAgents [21], the MAC framework [22], and GPT-4-based voting ensembles for medical diagnosis [70,71], consistently outperform single-agent setups in complex diagnostic reasoning. These findings emphasize the critical role of targeted enhancement and collaborative mechanisms in optimizing agent performance and clinical reliability.

The development of these specialized, interactive benchmarks is an important trend in evaluating biomedical LLM agents [72]. They shift the focus of evaluation from static knowledge recall to the actual utility and safety of agents in simulating real workflows. However, standardization of evaluation methods and standards remains a challenge that requires continued research and community consensus.

5. Key Challenges and Mitigation Strategies

Despite their promising potential, biomedical LLM agents face a spectrum of unresolved technical and ethical challenges.

As illustrated in Figure 3, these challenges span multiple layers, from factual hallucinations and interpretability to tool integration, data bias, and regulatory governance. Each category is further mapped to mitigation strategies proposed in the recent literature, highlighting the need for a principled risk-aware design.

5.1. Hallucinations and Factual Inaccuracies

(1): Challenge: LLM agents are prone to generating content that is plausible, but factually incorrect or unsupported, a phenomenon known as hallucination. In high-stakes biomedical applications, such errors can lead to harmful outcomes, including misdiagnoses, inappropriate treatment recommendations, and misleading scientific claims, thereby jeopardizing patient safety and research integrity. Manifestations include erroneous clinical calculations [4], citing irrelevant or unsupported references [5], fabricating nonexistent facts, or producing outputs that contradict input information [73].
(2): Mitigation Strategies: RAG [42] mitigates hallucination by grounding model outputs in verified knowledge sources such as PubMed or clinical guidelines, though its reliability still requires careful validation [74]. Delegating computation to external tools also helps reduce errors in quantitative reasoning tasks [4]. Self-correction mechanisms and reflection loops allow agents to evaluate and revise their outputs [13], while systems like SourceCheckup [5] support automated validation and transparent citation. Additionally, hallucination detection tools and purpose-built benchmarks (e.g., MedHal [67]) assist in identifying factual inconsistencies [75]. Prompt engineering [11] also plays a role in reducing hallucinations by guiding the model toward evidence-based responses.

5.2. Explainability and Transparency

(1): Challenge: The opaque nature of deep learning models makes it difficult to interpret the internal decision-making processes of LLMs [13]. In biomedical settings, lack of interpretability can hinder clinical trust, adoption, and accountability [13].
(2): Mitigation Strategies: Various interpretability techniques, such as SHAP, LIME, and attention visualization, offer insights into model predictions [2]. Reasoning transparency can be improved through structured prompting strategies like Chain-of-Thought (CoT), Chain-of-Diagnosis (CoD) [76], or argumentation frameworks used in ArgMed- Agents [28]. Integrating structured knowledge graphs grounds reasoning in transparent semantic relations [2], while model cards provide standardized documentation on training data, performance, and limitations [77]. Human-in-the-loop designs also enhance transparency by allowing users to query and critique model outputs interactively [78].

5.3. Data Quality, Availability, and Bias

(1): Challenge: LLM agent performance is highly sensitive to data quality and diversity [51]. Biomedical datasets are often limited in size, poorly annotated, or hypothetical in nature [51]. More critically, training data may encode biases related to race, gender, or socioeconomic status, which can lead to inequitable care recommendations [12,79]. For instance, models may misdiagnose skin conditions in patients with darker skin tones or suggest higher-cost treatments based on demographic profiles. These biases may be further amplified by feedback loops and dataset reuse [79].
(2): Mitigation Strategies: High-quality, diverse, and rigorously curated datasets are foundational to reducing bias [51]. Bias mitigation should be implemented across the entire pipeline, from data collection to training and evaluation, using techniques such as adversarial debiasing, data augmentation, and fairness-aware optimization [13]. Ongoing auditing of model outputs across demographic groups is critical [77], as is stakeholder involvement during dataset development to ensure inclusive representation and equity [77].

5.4. Tool Reliability and Integration

(1): Challenge: LLM agents rely heavily on external tools for computation, retrieval, and decision support [20]. Failures in tool reliability, outdated resources, or unstable interfaces can compromise agent performance. Additionally, agents often struggle with understanding tool functions, parameter usage, and invocation contexts [34], leading to misuse or overdependence [34].
(2): Mitigation Strategies: Ensuring tool robustness requires continuous development, validation, and API standardization [20]. Enhancing agent comprehension of tool documentation through fine-tuning or prompt optimization improves tool selection and usage accuracy [6]. Benchmarks such as Medcalc-Bench [45] provide structured assessments of agents’ ability to execute clinical computations via APIs, including drug dosing, renal function estimation, and risk scoring. Finally, incorporating error-handling mechanisms enables agents to respond gracefully to tool failures, improving overall reliability.

5.5. Multimodal Data Integration and Processing

(1): Challenge: Biomedical agents must often integrate heterogeneous data modalities text, images, genomics, and EHRs, which differ in format, scale, and sparsity [51,80]. Aligning and jointly analyzing these modalities in a unified framework remains technically challenging, especially under incomplete or noisy conditions [51].
(2): Mitigation Strategies: Progress requires more advanced fusion architectures (e.g., gated and tensor fusion, Transformer-based joint encoders) capable of modeling cross-modal relationships [81]. Robust alignment techniques are essential to link data at the semantic level [51]. Agents must also handle missing data gracefully, which calls for models explicitly designed for partial or sparse input [81]. Building comprehensive multimodal datasets and evaluation benchmarks is equally crucial for meaningful progress [51].

5.6. Complexity of Multi-Agent Collaboration

(1): Challenge: Multi-agent systems (MAS) offer enhanced flexibility and specialization but introduce challenges such as task allocation, inter-agent reasoning coordination, and communication overhead [35,82,83]. In biomedical workflows, failures in coordination can lead to suboptimal or erroneous outcomes [39].
(2): Mitigation Strategies: Optimizing collaboration requires adaptive control schemes (MDAgents [39]), hierarchical role assignment (KG4Diagnosis [23]), and efficient communication protocols [41]. Consensus mechanisms, such as supervisory agents, can arbitrate disagreements and ensure reliability. Explicit role specialization also reduces redundancy and conflict. For instance, MedCo demonstrates how agents can function as educators, students, and evaluators in a simulated medical training context [84].

5.7. Ethics, Privacy, Security, and Regulation

(1): Challenge: Ethical, legal, and social concerns are among the most pressing challenges in deploying biomedical LLM agents. Risks include algorithmic bias that may exacerbate healthcare inequities [12], erosion of traditional patient–clinician relationships [85], unclear accountability in cases of AI-induced medical error [12], overreliance on AI systems [75], and disputes around intellectual property of AI-generated content [35].

Privacy and security concerns arise from handling large volumes of sensitive patient data, with threats such as data leakage, re-identification, or misuse [12]. Agent interactions with external tools or other agents may also introduce new attack surfaces and cybersecurity vulnerabilities [86]. Compliance with regulations like HIPAA and GDPR is essential [87], but current regulatory frameworks (e.g., FDA, EU AI Act) may not fully account for the dynamic, emergent behavior of LLM agents [63]. The absence of clear approval pathways, accountability standards, and post-deployment oversight further compounds regulatory uncertainty [63].

(2): Mitigation Strategies: To address these concerns, ethical principles such as fairness, transparency, accountability, and safety must be embedded throughout the agent lifecycle [88]. Privacy-preserving techniques, including data anonymization, differential privacy, federated learning, secure multi-party computation, and synthetic data generation, help protect sensitive information [35]. Strengthening cybersecurity through adversarial testing and secure interface design is also essential [35].

In parallel, comprehensive governance frameworks and AI-specific regulatory standards are needed to clarify stakeholder responsibilities and ensure safe deployment [89]. Increasing algorithmic transparency and instituting robust accountability mechanisms are critical steps toward trustworthy and responsible AI use in biomedical contexts [12].

The matrix in Table 3 provides a principled lens to understand emerging risks, map mitigation pathways, and design appropriate assessment instruments. Trustworthy AI in biomedicine is not merely a technical objective, but a fundamental requirement for real-world adoption. Addressing these challenges requires sustained collaboration across technical, ethical, legal, and clinical domains.

6. Discussion

6.1. Major Insights from the Current Landscape

(1): The necessity of Human-in-the-Loop: Despite rapid advancements in autonomous LLM agents, the integration of Human-in-the-Loop (HITL) mechanisms remains indispensable for biomedical applications. Given the high-stakes nature of clinical decision-making, full agent autonomy in tasks such as diagnosis, treatment planning, or data interpretation is neither ethically acceptable nor technically safe. HITL frameworks enable real-time expert supervision, enhance system transparency, and act as safeguards against hallucinations, biases, and misuse of external tools. Prior studies, including the Zodiac framework [90] and virtual lab agents [91], underscore how clinician–agent collaboration can not only mitigate risk but also improve performance through iterative feedback and correction. Recent developments have also explored internal supervision via multi-agent architectures. For example, Cui et al. [92] proposed a dual-agent model in which a critical agent dynamically monitors and adjusts the reasoning process of a predictive agent—effectively emulating internalized HITL oversight. These findings collectively support the view that biomedical agents should be explicitly designed with human supervision interfaces, particularly in regulation-sensitive domains such as oncology, radiology, and pharmacovigilance.
(2): Biomedical full-spectrum applications: from basic research to clinical implementation: LLM agents are rapidly emerging as generalizable tools capable of driving innovation across the entire biomedical pipeline. From early-stage hypothesis generation and drug discovery to downstream tasks like clinical trial matching, diagnostic reasoning, and personalized care planning, these systems are demonstrating substantial potential to accelerate, scale, and automate workflows traditionally limited by human capacity. Notable examples such as MedAgent-Pro [93] and FUAS-Agents [94] illustrate how agents can operate within structured, rule-based clinical environments like surgical decision-making or protocol-driven diagnosis. Figure 4 summarizes representative agent workflows across biomedical domains, including healthcare optimization, dialogue simulation, scientific research, and clinical decision support. Simultaneously, open-ended agents like the “virtual laboratory” platform show promise for creativity and hypothesis exploration in basic science contexts, such as antibody design. Collectively, these developments support the growing consensus that biomedical LLM agents are not confined to isolated use cases but instead represent a versatile computational paradigm capable of supporting diverse biomedical tasks.

6.2. Proposed New Metrics for Agent Robustness and Trustworthiness

While existing benchmarks evaluate agent accuracy or task completion, they often overlook dimensions such as factual grounding, transparency, and resilience. Purpose-built metrics like hallucination rates [67], demographic bias scores [79], citation integrity [5], and reasoning traceability [28] are essential complements to traditional evaluation. Multidimensional benchmarking suites that capture error detection, self-reflection, and multi-agent coordination are urgently needed to assess trustworthiness holistically.

6.3. Toward Standardized Safety Evaluation

Mitigating biomedical agent risks is not a one-time engineering fix but an ongoing process of validation, governance, and user alignment. Standardized evaluation frameworks must address both task-specific performance and broader dimensions of safety, privacy, and fairness. Integration of emerging techniques like RAG, reflective prompting, federated learning, and secure multi-agent systems will be critical. Ultimately, trustworthy biomedical AI demands cross-disciplinary collaboration across technical, regulatory, and clinical communities.

6.4. Strategic Perspectives for Future Development

The field of biomedical LLM agents is evolving rapidly, and several strategic directions are emerging to support their safe, effective, and scalable integration into real-world applications.

(1): Enhanced Medical Reasoning and Planning Capabilities: Despite substantial progress, current biomedical LLM agents still face limitations in handling complex, long-range reasoning and multi-step task planning [28]. Future research is expected to explore multiple complementary directions.
First, advances in foundation models such as DeepSeek R1 [95] offer improved capabilities in long-context comprehension, instruction following, and nuanced medical reasoning. Second, new reasoning paradigms are being introduced beyond conventional Chain-of-Thought (CoT) and ReAct strategies. These include symbolic logic frameworks [96], causal inference-based models [97], and hierarchical planning algorithms that can better support high-stakes clinical decision-making [28]. Third, the ability of agents to self-assess their confidence and recognize knowledge gaps is gaining attention as a mechanism to ensure safety and reliability [98]. Additionally, adaptive architectures that dynamically restructure based on task requirements may offer further gains in generalization and robustness [99].
(2): Continual Learning and Knowledge Updating: Given the fast pace of biomedical research, agents must maintain mechanisms for continuous learning and timely knowledge integration to prevent the use of outdated or invalid information [12].
This need motivates future efforts in three key directions. First, efficient incorporation of new clinical guidelines, medical literature, and experimental findings, potentially through hybrid strategies combining RAG and fine-tuning, will be essential. Second, lifelong or continual learning frameworks [100] must be designed to allow agents to update incrementally while preserving previously acquired knowledge [101]. Third, while federated learning offers strong privacy protection, its deployment in real medical environments faces challenges such as data heterogeneity, communication latency, and privacy–utility trade-offs. Recent studies have proposed validated mitigation strategies, including adaptive optimization algorithms (e.g., FedProx, FedDyn) to address data imbalance, secure aggregation and homomorphic encryption to prevent information leakage, and federated differential privacy mechanisms for enhanced protection of gradient updates. Preliminary biomedical applications, such as the FLUID framework [102] and MedFL [103], have demonstrated that incorporating these strategies enables robust distributed learning and continual adaptation across hospitals without exposing sensitive patient data. These approaches represent a significant step toward trustworthy and collaborative AI in healthcare.
(3): Standardization, Governance, and Regulation: To enable trustworthy and equitable deployment, future work must establish clear standards for benchmarking, validation, and risk management [104]. Standardization efforts should aim to define consistent evaluation metrics and community benchmarks [105]. At the governance level, key priorities include ensuring algorithmic transparency, responsible data use, and stakeholder accountability throughout the agent lifecycle [106]. Regulatory frameworks must evolve to address the dynamic nature of AI systems, adopting mechanisms such as continuous monitoring, adaptive certification, and risk-sensitive regulation tailored to specific use cases and deployment environments [89].
(4): Expanding and Deepening Multimodal Capabilities: As biomedical data becomes increasingly diverse and multimodal, agents must be equipped to interpret and reason over heterogeneous inputs, including clinical notes, imaging, genomics, physiological signals, and patient-generated data [51]. Future progress depends on the development of advanced multimodal fusion architectures that can model intermodal dependencies and contextual relationships more effectively. There is also a pressing need to integrate emerging modalities such as wearable sensor data, proteomics, and behavioral analytics into unified reasoning frameworks. Importantly, improving the interpretability and traceability of multimodal reasoning processes will be key to enabling clinical trust and accountability.
(5): Human–AI Collaboration and Interaction: Rather than functioning as replacements for human professionals, future biomedical LLM agents should serve as collaborative partners that augment human expertise [13]. This calls for the design of more natural and efficient interaction interfaces, enabling users to guide, correct, and query agent behavior in real time. Collaborative decision-making protocols must be developed to support shared agency and role clarity between humans and machines. Furthermore, cultivating user trust will require increased transparency, reliability, and user control over agent outputs especially in high-risk or legally sensitive domains.
(6): Toward Trustworthy AI: Ultimately, the success of biomedical LLM agents will hinge on their alignment with both technical standards and broader societal values [107]. Building truly trustworthy systems requires integration across all preceding fronts from robust model architectures and continual learning protocols to ethical design, regulatory compliance, and meaningful stakeholder engagement. These agents must not only function reliably in technical terms but also earn the confidence of clinicians, patients, and regulatory bodies.

To achieve this, agent development should incorporate fairness, accountability, transparency, and safety as core design principles from the outset. Human oversight must be embedded both during the training and deployment phases, enabling clinicians to monitor, guide, and correct agent behavior. Governance frameworks should ensure algorithmic transparency, clear auditability of decisions, and safeguards against misuse or systemic bias.

In parallel, evaluation paradigms must evolve beyond static and task-isolated benchmarks to reflect the dynamic, interactive, and high-stakes nature of real-world biomedical applications. Trustworthiness should be assessed not only by accuracy but also through multifaceted criteria such as hallucination detection, citation fidelity, bias quantification, robustness under uncertainty, and clinical safety. Longitudinal, scenario-based, and simulation-driven benchmarks grounded in realistic agent workflows will be crucial to capturing these aspects comprehensively.

Standardizing these evaluation criteria, along with establishing open, reproducible, and community-endorsed benchmarking frameworks, is essential to enable transparent model comparison, support regulatory validation, and foster public trust. In this light, advancing evaluation methodology is not merely a technical task but a foundational requirement for building socially acceptable and ethically aligned biomedical LLM agents. The path forward will require sustained interdisciplinary collaboration among technologists, clinicians, ethicists, and policymakers to ensure that future agents are not only effective but also accountable, equitable, and human-centered.

6.5. Future Research Roadmap and Strategic Priorities

Beyond these specific research priorities, the next five years should also focus on establishing an integrative scientific foundation that unifies methodological innovation with translational impact. Our analysis reveals that true progress in biomedical LLM agents will depend on developing meta-level intelligence systems capable not only of performing reasoning and tool orchestration, but also of understanding when and how to apply them. This requires embedding adaptive self-regulation, uncertainty quantification, and epistemic awareness into agent architectures, enabling models to recognize the limits of their competence and seek human or algorithmic validation accordingly. Moreover, the convergence of federated learning, multimodal integration, and human AI collaboration offers an unprecedented opportunity to construct continuously learning ecosystems, where insights derived from distributed institutions can be aggregated securely to refine collective biomedical knowledge. Achieving this vision will demand not just technical sophistication but a co-evolution of governance, evaluation, and design ethics where trustworthiness becomes an operational property of the system rather than a post hoc assessment. Ultimately, advancing toward this paradigm will redefine biomedical LLM agents from passive assistants to self-reflective scientific collaborators that augment discovery, decision-making, and clinical reasoning in an accountable and interpretable manner.

7. Conclusions

Biomedical LLM agent systems that integrate language models with capabilities such as planning, memory, tool use, and interaction are transforming biomedical AI from passive information processing to active, goal-driven reasoning and decision-making. Recent advances span both single-agent architectures, which enhance internal mechanisms like self-reflection and long-range planning, and multi-agent frameworks that simulate collaborative clinical workflows. Techniques such as RAG, domain-adaptive fine-tuning, tool orchestration, and multimodal integration are increasingly applied in combination, forming hybrid, specialized agents tailored to biomedical tasks. Yet significant challenges remain: current evaluation benchmarks insufficiently capture the complexity and dynamism of real-world applications; hallucinations, limited interpretability, and data bias raise concerns about safety, equity, and trust; and privacy protection, tool reliability, and regulatory alignment pose further deployment barriers. To address these gaps, future efforts must focus on enhancing reasoning capabilities, enabling continual knowledge updating, standardizing evaluation protocols, deepening multimodal processing, and developing more intuitive human–AI interfaces. Ultimately, the goal is to build trustworthy biomedical agents that are not only technically robust but also clinically meaningful, ethically aligned, and practically deployable. Achieving this vision will require sustained innovation and close collaboration across disciplines, bringing together AI researchers, clinicians, ethicists, and policymakers to co-design intelligent systems that can safely and effectively advance biomedical research and healthcare delivery.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, Y.; Chen, X.; Jin, B.; Wang, S.; Ji, S.; Wang, W.; Han, J. A comprehensive survey of scientific large language models and their applications in scientific discovery. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 8783–8817. [Google Scholar]
Liu, L.; Yang, X.; Lei, J.; Liu, X.; Shen, Y.; Zhang, Z.; Wei, P.; Gu, J.; Chu, Z.; Qin, Z.; et al. A survey on medical large language models: Technology, application, trustworthiness, and future directions. arXiv 2024, arXiv:2406.03712. [Google Scholar] [CrossRef]
Gao, S.; Fang, A.; Huang, Y.; Giunchiglia, V.; Noori, A.; Schwarz, J.R.; Ektefaie, Y.; Kondic, J.; Zitnik, M. Empowering biomedical discovery with ai agents. Cell 2024, 187, 6125–6151. [Google Scholar] [CrossRef]
Goodell, A.J.; Chu, S.N.; Rouholiman, D.; Chu, L.F. Large language model agents can use tools to perform clinical calculations. npj Digit. Med. 2025, 8, 163. [Google Scholar] [CrossRef]
Wu, K.; Wu, E.; Wei, K.; Zhang, A.; Casasola, A.; Nguyen, T.; Riantawan, S.; Shi, P.; Ho, D.; Zou, J. An automated framework for assessing how well llms cite relevant medical references. Nat. Commun. 2025, 16, 3615. [Google Scholar] [CrossRef]
Zhu, Y.; Wei, S.; Wang, X.; Xue, K.; Zhang, S.; Zhang, X. MeNTi: Bridging medical calculator and LLM agent with nested tool calling. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 5097–5116. [Google Scholar]
Wu, X.; Zhao, Y.; Zhang, Y.; Wu, J.; Zhu, Z.; Zhang, Y.; Ouyang, Y.; Zhang, Z.; Wang, H.; Yang, J.; et al. Medjourney: Benchmark and evaluation of large language models over patient clinical journey. Adv. Neural Inf. Process. Syst. 2024, 37, 87621–87646. [Google Scholar]
Schmidgall, S.; Ziaei, R.; Harris, C.; Reis, E.; Jopling, J.; Moor, M. Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv 2024, arXiv:2405.07960. [Google Scholar] [CrossRef]
Huang, K.; Zhang, S.; Wang, H.; Qu, Y.; Lu, Y.; Roohani, Y.; Li, R.; Qiu, L.; Zhang, J.; Di, Y.; et al. Biomni: A general-purpose biomedical ai agent. bioRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
Fan, Y.; Xue, K.; Li, Z.; Zhang, X.; Ruan, T. An llm-based framework for biomedical terminology normalization in social media via multi-agent collaboration. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 10712–10726. [Google Scholar]
Luo, Y.; Shi, L.; Li, Y.; Zhuang, A.; Gong, Y.; Liu, L.; Lin, C. From intention to implementation: Automating biomedical research via LLMs. Sci. China Inf. Sci. 2025, 68, 170105. [Google Scholar] [CrossRef]
Qin, H.; Tong, Y. Opportunities and challenges for large language models in primary health care. J. Prim. Care Community Health 2025, 16, 21501319241312571. [Google Scholar]
Wang, W.; Ma, Z.; Wang, Z.; Wu, C.; Chen, W.; Li, X.; Yuan, Y. A survey of llm-based agents in medicine: How far are we from baymax? arXiv 2025, arXiv:2502.11211. [Google Scholar] [CrossRef]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. (HEALTH) 2021, 3, 1–23. [Google Scholar] [CrossRef]
Swanson, D.R. Fish oil, raynaud’s syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 1986, 30, 7–18. [Google Scholar] [CrossRef]
Swanson, D.R. Migraine and magnesium: Eleven neglected connections. Perspect. Biol. Med. 1988, 31, 526–557. [Google Scholar] [CrossRef]
Gilardi, F.; Alizadeh, M.; Kubli, M. Chatgpt outperforms crowd workers for text-annotation tasks. Proc. Natl. Acad. Sci. USA 2023, 120, e2305016120. [Google Scholar] [CrossRef]
Abbasian, M.; Azimi, I.; Rahmani, A.M.; Jain, R. Conversational health agents: A personalized llm-powered agent framework. arXiv 2023, arXiv:2310.02374. [Google Scholar] [CrossRef]
Ramos, M.C.; Collison, C.J.; White, A.D. A review of large language models and autonomous agents in chemistry. Chem. Sci. 2025, 16, 2514–2572. [Google Scholar] [CrossRef]
Tang, X.; Zou, A.; Zhang, Z.; Li, Z.; Zhao, Y.; Zhang, X.; Cohan, A.; Gerstein, M. MedAgents: Large language models as collaborators for zero-shot medical reasoning. arXiv 2024, arXiv:2311.10537. [Google Scholar]
Chen, X.; Yi, H.; You, M.; Liu, W.; Wang, L.; Li, H.; Zhang, X.; Guo, Y.; Fan, L.; Chen, G.; et al. Enhancing diagnostic capability with multi-agents conversational large language models. NPJ Digit. Med. 2025, 8, 159. [Google Scholar]
Zuo, K.; Jiang, Y.; Mo, F.; Lio, P. Kg4diagnosis: A hierarchical multi-agent llm framework with knowledge graph enhancement for medical diagnosis. In Proceedings of the AAAI Bridge Program on AI for Medicine and Healthcare. PMLR, Philadelphia, PA, USA, 25 February 2025; pp. 195–204. [Google Scholar]
Yue, L.; Xing, S.; Chen, J.; Fu, T. Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning. In Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Shenzhen, China, 22–25 November 2024; pp. 1–10. [Google Scholar]
Huang, K.; Qu, Y.; Cousins, H.; Johnson, W.A.; Yin, D.; Shah, M.; Zhou, D.; Altman, R.; Wang, M.; Cong, L. Crispr-gpt: An llm agent for automated design of gene-editing experiments. arXiv 2024, arXiv:2404.18021. [Google Scholar]
Roohani, Y.H.; Vora, J.; Huang, Q.; Liang, P.; Leskovec, J. BioDiscoveryAgent: An AI agent for designing genetic perturbation experiments. arXiv 2024, arXiv:2405.17631. [Google Scholar]
Das, R.; Maheswari, K.; Siddiqui, S.; Arora, N.; Paul, A.; Nanshi, J.; Ud-balkar, V.; Sarvade, A.; Chaturvedi, H.; Shvartsman, T.; et al. Improved precision oncology question-answering using agentic llm. medRxiv 2024. [Google Scholar] [CrossRef]
Hong, S.; Xiao, L.; Zhang, X.; Chen, J. Argmed-agents: Explainable clinical decision reasoning with llm disscusion via argumentation schemes. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisboa, Portugal, 3–6 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 5486–5493. [Google Scholar]
Chan, T.K.; Dinh, N.-D. Entagents: Ai agents for complex knowledge otolaryngology. medRxiv 2025. [Google Scholar] [CrossRef]
Luo, L.; Ning, J.; Zhao, Y.; Wang, Z.; Ding, Z.; Chen, P.; Fu, W.; Han, Q.; Xu, G.; Qiu, Y.; et al. Taiyi: A bilingual fine-tuned large language model for diverse biomedical tasks. J. Am. Med. Inform. Assoc. 2024, 31, 1865–1874. [Google Scholar] [CrossRef]
Kim, S. Medbiolm: Optimizing medical and biological qa with fine-tuned large language models and retrieval-augmented generation. arXiv 2025, arXiv:2502.03004. [Google Scholar]
Hsu, E.; Roberts, K. Llm-ie: A python package for biomedical generative information extraction with large language models. JAMIA Open 2025, 8, ooaf012. [Google Scholar] [CrossRef] [PubMed]
Mondal, D.; Inamdar, A. Seqmate: A novel large language model pipeline for automating rna sequencing. arXiv 2024, arXiv:2407.03381. [Google Scholar]
Yang, Z.; Qian, J.; Huang, Z.-A.; Tan, K.C. Qm-tot: A medical tree of thoughts reasoning framework for quantized model. arXiv 2025, arXiv:2504.12334. [Google Scholar]
Luo, J.; Zhang, W.; Yuan, Y.; Zhao, Y.; Yang, J.; Gu, Y.; Wu, B.; Chen, B.; Qiao, Z.; Long, Q.; et al. Large language model agent: A survey on methodology, applications and challenges. arXiv 2025, arXiv:2503.21460. [Google Scholar] [CrossRef]
Lu, Y.; Wang, J. Karma: Leveraging multi-agent llms for automated knowledge graph enrichment. arXiv 2025, arXiv:2502.06472. [Google Scholar] [CrossRef]
Jia, Z.; Jia, M.; Duan, J.; Wang, J. Ddo: Dual-decision optimization via multi-agent collaboration for llm-based medical consultation. arXiv 2025, arXiv:2505.18630. [Google Scholar]
Kim, Y. Healthcare Agents: Large Language Models in Health Prediction and Decision-Making. Ph.D. Dissertation, Massachusetts Institute of Technology, Cambridge, MA, USA, 2025. [Google Scholar]
Kim, Y.; Park, C.; Jeong, H.; Chan, Y.S.; Xu, X.; McDuff, D.; Lee, H.; Ghassemi, M.; Breazeal, C.; Park, H.W. Mdagents: An adaptive collaboration of llms for medical decision-making. Adv. Neural Inf. Process. Syst. 2024, 37, 79410–79452. [Google Scholar]
Xiao, M.; Cai, X.; Wang, C.; Zhou, Y. m-kailin: Knowledge-driven agentic scientific corpus distillation framework for biomedical large language models training. arXiv 2025, arXiv:2504.19565. [Google Scholar]
Cheng, Y.; Zhang, C.; Zhang, Z.; Meng, X.; Hong, S.; Li, W.; Wang, Z.; Wang, Z.; Yin, F.; Zhao, J.; et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv 2024, arXiv:2401.03428. [Google Scholar] [CrossRef]
Liu, S.; McCoy, A.B.; Wright, A. Improving large language model applications in biomedicine with retrieval-augmented generation: A systematic review, meta-analysis, and clinical development guidelines. J. Am. Med Inform. Assoc. 2025, 32, ocaf008. [Google Scholar] [CrossRef]
Christophe, C.; Kanithi, P.; Munjal, P.; Raha, T.; Hayat, N.; Rajan, R.; Al-Mahrooqi, A.; Gupta, A.; Salman, M.U.; Pimentel, M.A.F.; et al. Med42: Evaluating fine-tuning strategies for medical LLMs: Full parameter vs. parameter-efficient approaches. arXiv 2024, arXiv:2404.14779. [Google Scholar]
Song, S.; Xu, H.; Ma, J.; Li, S.; Peng, L.; Wan, Q.; Liu, X.; Yu, J. How to complete domain tuning while keeping general ability in llm: Adaptive layer-wise and element-wise regularization. arXiv 2025, arXiv:2501.13669. [Google Scholar]
Khandekar, N.; Jin, Q.; Xiong, G.; Dunn, S.; Applebaum, S.; Anwar, Z.; Sarfo-Gyamfi, M.; Safranek, C.; Anwar, A.; Zhang, A.; et al. Medcalc- bench: Evaluating large language models for medical calculations. Adv. Neural Inf. Process. Syst. 2024, 37, 84730–84745. [Google Scholar]
Liao, Y.; Jiang, S.; Wang, Y.; Wang, Y. Reflectool: To- wards reflection-aware tool-augmented clinical agents. arXiv 2024, arXiv:2410.17657. [Google Scholar]
He, Y.; Li, A.; Liu, B.; Yao, Z.; He, Y. Medorch: Medical diagnosis with tool-augmented reasoning agents for flexible extensibility. arXiv 2025, arXiv:2506.00235. [Google Scholar]
Li, B.; Yan, T.; Pan, Y.; Luo, J.; Ji, R.; Ding, J.; Xu, Z.; Liu, S.; Dong, H.; Lin, Z.; et al. MMedAgent: Learning to use medical tools with multi-modal agent. arXiv 2024, arXiv:2407.02483. [Google Scholar]
Shi, W.; Xu, R.; Zhuang, Y.; Yu, Y.; Zhang, J.; Wu, H.; Zhu, Y.; Ho, J.; Yang, C.; Wang, M.D. EHRAgent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; p. 22315. [Google Scholar]
Choi, J.; Palumbo, N.; Chalasani, P.; Engelhard, M.M.; Jha, S.; Kumar, A.; Page, D. MALADE: Orchestration of LLM-powered agents with retrieval augmented generation for pharmacovigilance. arXiv 2024, arXiv:2408.01869. [Google Scholar] [CrossRef]
Niu, Q.; Chen, K.; Li, M.; Feng, P.; Bi, Z.; Yan, L.K.; Zhang, Y.; Yin, C.H.; Fei, C.; Liu, J.; et al. From text to multimodality: Exploring the evolution and impact of large language models in medical practice. arXiv 2024, arXiv:2410.01812. [Google Scholar]
Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf. Process. Syst. 2023, 36, 28541–28564. [Google Scholar]
Liu, T.; Xiao, Y.; Luo, X.; Xu, H.; Zheng, W.J.; Zhao, H. Geneverse: A collection of open-source multimodal large language models for genomic and proteomic research. arXiv 2024, arXiv:2406.15534. [Google Scholar] [CrossRef]
Liu, P.; Bansal, S.; Dinh, J.; Pawar, A.; Satishkumar, R.; Desai, S.; Gupta, N.; Wang, X.; Hu, S. Medchat: A multi-agent framework for multimodal diagnosis with large language models. arXiv 2025, arXiv:2506.07400. [Google Scholar]
Ferber, D.; El Nahhas, O.S.; Wo, G.; Wiest, I.C.; Clusmann, J.; Leßmann, M.-E.; Foersch, S.; Lammert, J.; Tschochohei, M.; Ja¨ger, D.; et al. Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology. Nat. Cancer 2025, 6, 1337–1349. [Google Scholar] [CrossRef]
Gao, S.; Zhu, R.; Kong, Z.; Noori, A.; Su, X.; Ginder, C.; Tsiligkaridis, T.; Zitnik, M. Txagent: An ai agent for therapeutic reasoning across a universe of tools. arXiv 2025, arXiv:2503.10970. [Google Scholar] [CrossRef]
Liu, P.; Ren, Y.; Tao, J.; Ren, Z. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Comput. Biol. Med. 2024, 171, 108073. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Lu, Y.; Chen, S.; Hu, X.; Zhao, J.; Fu, T.; Zhao, Y. DrugAgent: Automating AI-aided drug discovery programming through LLM multi-agent collaboration. In Proceedings of the 2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle, Philadelphia, PA, USA, 4 March 2025. [Google Scholar]
Xu, R.; Zhuang, Y.; Zhong, Y.; Yu, Y.; Tang, X.; Wu, H.; Wang, M.D.; Ruan, P.; Yang, D.; Wang, T.; et al. Medagentgym: Training llm agents for code-based medical reasoning at scale. arXiv 2025, arXiv:2506.04405. [Google Scholar]
Wang, Z.; Zhu, Y.; Zhao, H.; Zheng, X.; Sui, D.; Wang, T.; Tang, W.; Wang, Y.; Harrison, E.; Pan, C.; et al. Colacare: Enhancing electronic health record modeling through large language model-driven multi-agent collaboration. Proc. ACM Web Conf. 2025, 2025, 2250–2261. [Google Scholar]
Yi, Z.; Xiao, T.; Albert, M.V. A multimodal multi-agent framework for radiology report generation. arXiv 2025, arXiv:2505.09787. [Google Scholar] [CrossRef]
Almansoori, M.; Kumar, K.; Cholakkal, H. Self-evolving multi-agent simulations for realistic clinical interactions. arXiv 2025, arXiv:2503.22678. [Google Scholar]
Jiang, Y.; Black, K.C.; Geng, G.; Park, D.; Ng, A.Y.; Chen, J.H. Medagentbench: Dataset for benchmarking llms as agents in medical applications. arXiv 2025, arXiv:2501.14654. [Google Scholar]
Mehandru, N.; Miao, B.Y.; Almaraz, E.R.; Sushil, M.; Butte, A.J.; Alaa, A. Evaluating large language models as agents in the clinic. NPJ Digit. Med. 2024, 7, 84. [Google Scholar] [CrossRef]
Johnson, A.; Bulgarelli, L.; Pollard, T.; Horng, S.; Celi, L.A.; Mark, R.; Mimic-iv. PhysioNet. 2020, pp. 49–55. Available online: https://physionet.org/content/mimiciv/1.0/ (accessed on 23 August 2021).
Song, K.; Trotter, A.; Chen, J.Y. Llm agent swarm for hypothesis- driven drug discovery. arXiv 2025, arXiv:2504.17967. [Google Scholar] [CrossRef]
Mehenni, G.; Zouaq, A. Medhal: An evaluation dataset for medical hallucination detection. arXiv 2025, arXiv:2504.08596. [Google Scholar] [CrossRef]
Mitchener, L.; Laurent, J.M.; Tenmann, B.; Narayanan, S.; Wellawatte, G.P.; White, A.; Sani, L.; Rodriques, S.G. Bixbench: A comprehensive benchmark for llm-based agents in computational biology. arXiv 2025, arXiv:2503.00096. [Google Scholar]
Fan, Z.; Wei, L.; Tang, J.; Chen, W.; Wang, S.; Wei, Z.; Huang, F. AI Hospital: Benchmarking large language models in a multi-agent medical interaction simulator. arXiv 2025, arXiv:2402.09742. [Google Scholar]
Altermatt, F.R.; Neyem, A.; Sumonte, N.; Mendoza, M.; Villagran, I.; Lacassie, H.J. Performance of single-agent and multi-agent language models in spanish language medical competency exams. BMC Med. Educ. 2025, 25, 666. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, H.; Zheng, Y.; Wu, X. A layered debating multi-agent system for similar disease diagnosis. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 539–549. [Google Scholar]
Tu, T.; Schaekermann, M.; Palepu, A.; Saab, K.; Freyberg, J.; Tanno, R.; Wang, A.; Li, B.; Amin, M.; Cheng, Y.; et al. Towards conversational diagnostic artificial intelligence. Nature 2025, 642, 442–450. [Google Scholar] [CrossRef]
Garcia-Fernandez, C.; Felipe, L.; Shotande, M.; Zitu, M.; Tripathi, A.; Rasool, G.; Naqa, I.E.; Rudrapatna, V.; Valdes, G. Trustworthy ai for medicine: Continuous hallucination detection and elimination with check. arXiv 2025, arXiv:2506.11129. [Google Scholar]
Bunnell, D.J.; Bondy, M.J.; Fromtling, L.M.; Ludeman, E.; Gourab, K. Bridging ai and healthcare: A scoping review of retrieval- augmented generation—Ethics, bias, transparency, improvements, and applications. medRxiv 2025. [Google Scholar] [CrossRef]
Rani, M.; Mishra, B.K.; Thakker, D.; Babar, M.; Jones, W.; Din, A. Biases and trustworthiness challenges with mitigation strategies for large language models in healthcare. In Proceedings of the 2024 International Conference on IT and Industrial Technologies (ICIT), Chiniot, Pakistan, 10–12 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Chen, J.; Gui, C.; Gao, A.; Ji, K.; Wang, X.; Wan, X.; Wang, B. Cod, towards an interpretable medical agent using chain of diagnosis. arXiv 2024, arXiv:2407.13301. [Google Scholar] [CrossRef]
Goktas, P.; Grzybowski, A. Shaping the future of healthcare: Ethical clinical challenges and pathways to trustworthy ai. J. Clin. Med. 2025, 14, 1605. [Google Scholar] [CrossRef]
Jin, Q.; Wang, Z.; Yang, Y.; Zhu, Q.; Wright, D.; Huang, T.; Wilbur, W.J.; He, Z.; Taylor, A.; Chen, Q.; et al. Agentmd: Empowering language agents for risk prediction with large-scale clinical tool learning. arXiv 2024, arXiv:2402.13225. [Google Scholar]
Comeau, D.S.; Bitterman, D.S.; Celi, L.A. Preventing unrestricted and unmonitored ai experimentation in healthcare through transparency and accountability. npj Digit. Med. 2025, 8, 42. [Google Scholar] [CrossRef]
Li, Y.C.; Wang, L.; Law, J.N.; Murali, T.; Pandey, G. Integrating multimodal data through interpretable heterogeneous ensembles. Bioinform. Adv. 2022, 2, vbac065. [Google Scholar] [CrossRef]
AlSaad, R.; Alrazaq, A.A.; Boughorbel, S.; Ahmed, A.; Renault, M.A.; Damseh, R.; Sheikh, J. Multimodal large language models in health care: Applications, challenges, and future outlook. J. Med. Internet Res. 2024, 26, e59505. [Google Scholar] [CrossRef]
Nweke, I.P.; Ogadah, C.O.; Koshechkin, K.; Oluwasegun, P.M. Multi-agent ai systems in healthcare: A systematic review enhancing clinical decision-making. Asian J. Med. Princ. Clin. Pract. 2025, 8, 273–285. [Google Scholar] [CrossRef]
Pandey, H.G.; Amod, A.; Kumar, S. Advancing healthcare automation: Multi-agent system for medical necessity justification. arXiv 2024, arXiv:2404.17977. [Google Scholar] [CrossRef]
Wei, H.; Qiu, J.; Yu, H.; Yuan, W. Medco: Medical education copilots based on a multi-agent framework. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2025; pp. 119–135. [Google Scholar]
Nov, O.; Aphinyanaphongs, Y.; Lui, Y.W.; Mann, D.; Porfiri, M.; Riedl, M.; Rizzo, J.-R.; Wiesenfeld, B. The transformation of patient-clinician relationships with ai-based medical advice. Commun. ACM 2021, 64, 46–48. [Google Scholar] [CrossRef]
Chen, K.; Zhen, T.; Wang, H.; Liu, K.; Li, X.; Huo, J.; Yang, T.; Xu, J.; Dong, W.; Gao, Y. Medsentry: Understanding and mitigating safety risks in medical llm multi-agent systems. arXiv 2025, arXiv:2505.20824. [Google Scholar]
Pham, T. Ethical and legal considerations in healthcare ai: Innovation and policy for safe and fair use. R. Soc. Open Sci. 2025, 12, 241873. [Google Scholar] [CrossRef]
Cheong, B.C. Transparency and accountability in ai systems: Safe- guarding wellbeing in the age of algorithmic decision-making. Front. Hum. Dyn. 2024, 6, 1421273. [Google Scholar] [CrossRef]
Palaniappan, K.; Lin, E.Y.T.; Vogel, S. Global regulatory frameworks for the use of artificial intelligence (ai) in the healthcare services sector. Healthcare 2024, 12, 562. [Google Scholar] [CrossRef]
Zhou, Y.; Zhang, P.; Song, M.; Zheng, A.; Lu, Y.; Liu, Z.; Chen, Y.; Xi, Z. Zodiac: A cardiologist-level llm framework for multi-agent diagnostics. arXiv 2024, arXiv:2410.02026. [Google Scholar]
Swanson, K.; Wu, W.; Bulaong, N.L.; Pak, J.E.; Zou, J. The virtual lab: Ai agents design new SARS-CoV-2 nanobodies with experimental validation. bioRxiv 2024. [Google Scholar] [CrossRef]
Cui, H.; Shen, Z.; Zhang, J.; Shao, H.; Qin, L.; Ho, J.C.; Yang, C. Llms-based few-shot disease predictions using ehr: A novel approach combining predictive agent reasoning and critical agent instruction. AMIA Annu. Symp. Proc. 2024, 2025, 319. [Google Scholar]
Wang, Z.; Wu, J.; Cai, L.; Low, C.H.; Yang, X.; Li, Q.; Jin, Y. Medagent-pro: Towards evidence-based multi-modal medical diagno- sis via reasoning agentic workflow. arXiv 2025, arXiv:2503.18968. [Google Scholar]
Zhao, L.; Bai, J.; Bian, Z.; Chen, Q.; Li, Y.; Li, G.; He, M.; Yao, H.; Zhang, Z. Autonomous multi-modal llm agents for treatment planning in focused ultrasound ablation surgery. arXiv 2025, arXiv:2505.21418. [Google Scholar] [CrossRef]
Moëll, B.; Aronsson, F.S.; Akbar, S. Medical reasoning in LLMs: An in-depth analysis of DeepSeek R1. Front. Artif. Intell. 2025, 8, 1616145. [Google Scholar] [CrossRef]
Matsumoto, N.; Choi, H.; Moran, J.; Hernandez, M.E.; Venkatesan, M.; Li, X.; Chang, J.-H.; Wang, P.; Moore, J.H. Escargot: An ai agent leveraging large language models, dynamic graph of thoughts, and biomedical knowledge graphs for enhanced reasoning. Bioinformatics 2025, 41, btaf031. [Google Scholar] [CrossRef]
Xu, W.; Luo, G.; Meng, W.; Zhai, X.; Zheng, K.; Wu, J.; Li, Y.; Xing, A.; Li, J.; Li, Z.; et al. Mragent: An llm-based automated agent for causal knowledge discovery in disease via mendelian randomization. Brief. Bioinform. 2025, 26, bbaf140. [Google Scholar] [CrossRef]
Atf, Z.; Safavi-Naini, S.A.A.; Lewis, P.R.; Mahjoubfar, A.; Naderi, N.; Savage, T.R.; Soroush, A. The challenge of uncertainty quan- tification of large language models in medicine. arXiv 2025, arXiv:2504.05278. [Google Scholar]
Zhuang, Y.; Jiang, W.; Zhang, J.; Yang, Z.; Zhou, J.T.; Zhang, C. Learning to be a doctor: Searching for effective medical agent architectures. arXiv 2025, arXiv:2504.11301. [Google Scholar] [CrossRef]
Zheng, J.; Shi, C.; Cai, X.; Li, Q.; Zhang, D.; Li, C.; Yu, D.; Ma, Q. Lifelong learning of large language model based agents: A roadmap. arXiv 2025, arXiv:2501.07278. [Google Scholar] [CrossRef]
Li, J.; Lai, Y.; Li, W.; Ren, J.; Zhang, M.; Kang, X.; Wang, S.; Li, P.; Zhang, Y.-Q.; Ma, W.; et al. Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv 2024, arXiv:2405.02957. [Google Scholar] [CrossRef]
Casaletto, J.A.; Foley, P.; Fernandez, M.; Sanders, L.M.; Scott, R.T.; Ranjan, S.; Jain, S.; Haynes, N.; Boerma, M.; Costes, S.V.; et al. Foundational architecture enabling federated learning for training space biomedical machine learning models between the international space station and earth. bioRxiv 2025. [Google Scholar] [CrossRef]
Zhang, L.; Li, Y. Federated learning with layer skipping: Efficient training of large language models for healthcare nlp. arXiv 2025, arXiv:2504.10536. [Google Scholar] [CrossRef]
Rosenthal, J.T.; Beecy, A.; Sabuncu, M.R. Rethinking clinical trials for medical ai with dynamic deployments of adaptive systems. npj Digit. Med. 2025, 8, 252. [Google Scholar] [CrossRef] [PubMed]
Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. A systematic review of testing and evaluation of healthcare applications of large language models (llms). medRxiv 2024. [Google Scholar] [CrossRef]
Solaiman, B.; Mekki, Y.M.; Qadir, J.; Ghaly, M.; Abdelkareem, M.; Al-Ansari, A. A “true lifecycle approach” towards governing healthcare ai with the gcc as a global governance model. npj Digit. Med. 2025, 8, 337. [Google Scholar] [CrossRef] [PubMed]
Sankar, B.S.; Gilliland, D.; Rincon, J.; Hermjakob, H.; Yan, Y.; Adam, I.; Lemaster, G.; Wang, D.; Watson, K.; Bui, A.; et al. Building an ethical and trustworthy biomedical ai ecosystem for the translational and clinical integration of foundation models. Bioengineering 2024, 11, 984. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the literature search and screening process for biomedical LLM agent studies included in this review.

Figure 2. A conceptual landscape of biomedical LLM agents. The architecture spans from agentic core components to system architectures, key enabling techniques, and real-world applications.

Figure 3. Key challenges and mitigation strategies for biomedical LLM agents.

Figure 4. Comprehensive workflows and use cases of biomedical LLM agents.

Table 1. Representative Biomedical LLM Agents and Frameworks.

Agent/Framework	Core LLM	Key Methods	Application	Ref.
MedAgents	GPT-4, GPT-3.5	MAS, RAG (implicit), CoT	Medical reasoning	[21]
MAC Framework	GPT-4, GPT-3.5	MAS (MDT simulation)	Rare disease diagnosis	[22]
KG4Diagnosis	GPT, MedPaLM	MAS, RAG, Tool Use	Diagnosis w/KG	[23]
BioResearcher	GPT-4o	MAS, Literature	Automated research	[11]
CT-Agent	GPT-4	MAS, ReAct, Tool Use	Clinical trial analysis	[24]
CRISPR-GPT	N/A	Tool Use, Planning	Gene editing design	[25]
BioDiscoveryAgent	Claude 3.5	PubMed Tool, Critic Agent	Perturbation planning	[26]
GeneSilico Copilot	N/A	RAG, Retrieval Tools, ReAct	Oncology (breast cancer)	[27]
ArgMed-Agents	GPT-3.5, GPT-4	MAS, Symbolic Reasoning	Explainable decision making	[28]
ENTAgents	N/A	MAS, RAG, Reflection	ENT QA system	[29]
Taiyi	Qwen-7B-base	Bilingual Fine-tuning	Biomed NLP tasks	[30]
MedBioLM	N/A	Fine-tuning, RAG	Biomedical QA	[31]
LLM-IE Agent	N/A	Prompt Editor Tool Use	Biomedical IE (NER, RE)	[32]
SeqMate	gpt-3.5-turbo	BioTools, Planning	RNA-seq analysis	[33]
Clinical Calc Agent	LLaMa, GPT-4o	Code Tool, RAG, API	Clinical scoring tasks	[4]

Note: MAS = Multi-Agent System; RAG = Retrieval-Augmented Generation; CoT = Chain-of-Thought; MDT = Multidisciplinary Team; API = Application Programming Interface; ENT = Ear, Nose, and Throat (otolaryngology); NLP = Natural Language Processing; IE = Information Extraction; NER = Named Entity Recognition; RE = Relation Extraction. “Core LLM” refers to the foundational model; actual implementations may vary or be further fine-tuned.

Table 2. Biomedical LLM Agent Evaluation Benchmarks.

Benchmark	Focus	Modalities	Metrics	Ref.
AgentClinic	Clinical simulation	Dialogue, Image, EHR	Accuracy, Bias analysis	[8]
MedJourney	Patient journey QA	Dialogue, Text	Accuracy, BLEU, Recall	[7]
CalcQA	Tool-based execution	Text (cases)	Tool accuracy	[6]
MedAgentBench	FHIR interaction	Text + EHR	Task success rate	[63]
MedHal	Hallucination detection	Text (clinical, QA)	Accuracy, F1 score	[67]
SourceCheckup	Citation verification	Text, URL	Source accuracy	[5]
BixBench	Bioinformatics QA	Text + Bio data	Task accuracy	[68]

Note: EHR = Electronic Health Record; FHIR = Fast Healthcare Interoperability Resources; QA = Question Answering; BLEU = Bilingual Evaluation Understudy (machine-translation metric); F1 = harmonic mean of precision and recall; URL = Uniform Resource Locator.

Table 3. Challenges–Mitigations–Metrics for Biomedical LLM Agents.

Challenge	Impact	Current Mitigation	Limitations	Proposed Metric
Hallucinations and Factual Inaccuracies	Misdiagnoses, fabricated references, unverified outputs	RAG, tool-based computation, self-correction, prompt engineering	Residual factual errors, difficulty defining, hallucination boundaries	Hallucination Trust Index (HTI)
Explainability and Transparency	Limited interpretability, clinician distrust	SHAP, LIME, CoT, CoD, argumentation, model cards, human-in-the-loop	Opaque internal logic, difficult for non-experts to audit outputs	—
Data Quality, Availability, and Bias	Biased predictions, reduced generalization	Dataset curation, bias audits, fairness-aware training	Limited real-world demographic coverage	Equity Alignment Score (EAS)
Tool Reliability and Integration	Calculation errors, API misuse	Prompt refinement, toolspecific evaluation, error handling	Interface mismatch, invocation confusion	Tool Execution Fidelity (TEF)
Multimodal Data Integration and Processing	Sparse or conflicting patient signals	Joint encoders, gated fusion, alignment protocols	Modality gaps, noisy inputs	Multimodal Alignment Consistency (MAC)
Multi-Agent Collaboration Complexity	Inter-agent misalignment, redundant roles	Role assignment, consensus mechanisms	Coordination latency, planning inconsistencies	Agent Coordination Latency (ACL)
Ethics, Privacy, Security, Regulation	Data misuse, legal ambiguity, safety concerns	Privacy-preserving computation, AI-specific governance	No agent-specific regulatory standard, unclear responsibility	AI Governance Readiness Index (AGRI)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, X.; Sankar, R. Large Language Model Agents for Biomedicine: A Comprehensive Review of Methods, Evaluations, Challenges, and Future Directions. Information 2025, 16, 894. https://doi.org/10.3390/info16100894

AMA Style

Xu X, Sankar R. Large Language Model Agents for Biomedicine: A Comprehensive Review of Methods, Evaluations, Challenges, and Future Directions. Information. 2025; 16(10):894. https://doi.org/10.3390/info16100894

Chicago/Turabian Style

Xu, Xiaoran, and Ravi Sankar. 2025. "Large Language Model Agents for Biomedicine: A Comprehensive Review of Methods, Evaluations, Challenges, and Future Directions" Information 16, no. 10: 894. https://doi.org/10.3390/info16100894

APA Style

Xu, X., & Sankar, R. (2025). Large Language Model Agents for Biomedicine: A Comprehensive Review of Methods, Evaluations, Challenges, and Future Directions. Information, 16(10), 894. https://doi.org/10.3390/info16100894

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Model Agents for Biomedicine: A Comprehensive Review of Methods, Evaluations, Challenges, and Future Directions

Abstract

1. Introduction

2. Data Preparation

3. Methodology and Architecture of Biomedical LLM Agents

3.1. Fundamental Concepts: From LLMs to Agents

3.2. LLM Agent Architecture

3.2.1. Single-Agent Systems

3.2.2. Multi-Agent Systems (MAS)

3.3. Methodology of Biomedical Agents

3.3.1. Retrieval-Augmented Generation

3.3.2. Fine-Tuning and Domain Adaptation

3.3.3. Tool Use

3.3.4. Multimodal Integration

4. Performance Evaluation and Benchmarking

4.1. The Need for Agent-Specific Evaluation

4.2. Key Benchmarks for Biomedical LLM Agents

4.3. Evaluation Indicators and Methods

4.4. Performance Comparison

5. Key Challenges and Mitigation Strategies

5.1. Hallucinations and Factual Inaccuracies

5.2. Explainability and Transparency

5.3. Data Quality, Availability, and Bias

5.4. Tool Reliability and Integration

5.5. Multimodal Data Integration and Processing

5.6. Complexity of Multi-Agent Collaboration

5.7. Ethics, Privacy, Security, and Regulation

6. Discussion

6.1. Major Insights from the Current Landscape

6.2. Proposed New Metrics for Agent Robustness and Trustworthiness

6.3. Toward Standardized Safety Evaluation

6.4. Strategic Perspectives for Future Development

6.5. Future Research Roadmap and Strategic Priorities

7. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI