TriAgent: An Adaptive Multi-Agent Architecture for Crisis Clinical Decision Support Under Incomplete Information

Ibrahim, Ahmed; AlSanousi, Ali; Serag, Ahmed

doi:10.3390/ai7060230

Open AccessArticle

TriAgent: An Adaptive Multi-Agent Architecture for Crisis Clinical Decision Support Under Incomplete Information

by

Ahmed Ibrahim

¹

,

Ali AlSanousi

² and

Ahmed Serag

^1,*

¹

AI Innovation Lab, Weill Cornell Medicine—Qatar, Doha P.O. Box 24144, Qatar

²

Hamad Medical Corporation, Doha P.O. Box 3050, Qatar

^*

Author to whom correspondence should be addressed.

AI 2026, 7(6), 230; https://doi.org/10.3390/ai7060230

Submission received: 30 April 2026 / Revised: 12 June 2026 / Accepted: 15 June 2026 / Published: 18 June 2026

(This article belongs to the Special Issue Agentic AI for Healthcare: Reasoning, Safety, and Clinically Aligned Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

Agentic artificial intelligence (AI) offers new opportunities for intelligent clinical decision support, but deployment in emergency and crisis settings remains challenging because time-critical recommendations must often be generated under incomplete patient information and system constraints. Conventional clinical decision support systems rely on rule-based workflows that degrade when structured data are absent, while standalone language models lack coordination mechanisms to enforce mandatory safety checks. We present TriAgent, a multi-agent framework that unifies adaptive orchestration, iterative retrieval, embedded safety verification, and end-to-end auditability within a single crisis clinical decision support workflow. An Orchestrator Agent dynamically selects specialist modules for clinical assessment, retrieval, treatment planning, safety verification, and system coordination, with routing determined by model reasoning rather than fixed execution paths. A retrieval sub-agent performs iterative query refinement and relevance grading over 49,000 MIMIC-IV discharge notes, while medication-conflict screening and allergy-risk assessment are invoked in parallel only when clinically indicated. A Critique Agent reviews the full reasoning trace before recommendation finalization. In a retrospective evaluation on 1000 real emergency presentations under synthesized incomplete-information inputs, TriAgent achieved 85.0% critical-case recall and 65.7% overall triage accuracy, versus at most 14.7% and 43.4% for matched single-model and retrieval-only baselines, with safety checks executed on every continuation pathway and adaptive routing invoking only the modules each case required. These results support multi-agent orchestration as a promising design pattern for transparent and auditable AI in healthcare. These gains are internal system properties; clinical-safety benefit remains to be established through prospective, clinician-involved validation.

Keywords:

clinical decision support (CDS); agentic AI; multi-agent systems; retrieval-augmented generation (RAG); large language models (LLM); emergency medicine; drug-drug interaction; patient safety

1. Introduction

The integration of artificial intelligence (AI) into clinical workflows has created substantial opportunities for more responsive, efficient, and personalised medical care [1,2]. Yet despite major progress in medical informatics, a critical gap remains in emergency settings where time-sensitive decisions must be made under conditions of incomplete information [3,4]. When a patient presents with acute symptoms but no accessible longitudinal electronic health record, whether due to system unavailability, inter-facility transfer, or first-time presentation, clinicians face the compounded challenge of rapid assessment and safe intervention under uncertainty. This problem is particularly pronounced in emergency medicine, rural healthcare, mass casualty events, and cross-institutional transfers where electronic health record interoperability remains limited [5,6].

Consider a representative scenario in which a patient presents with limited accessible history beyond verbal report. Traditional workflows may require time-intensive chart review, specialist consultation, and cautious treatment delays while additional information is obtained. A system capable of retrieving analogous historical cases, proposing an initial treatment plan, screening for medication interactions, and assessing allergy risk could meaningfully accelerate decision-making while reducing the cognitive burden that contributes to error during high-pressure presentations [7,8].

However, existing clinical decision support systems (CDSS) remain limited in incomplete-information emergency settings. Rule-based systems often depend on completely structured inputs, while standalone safety databases typically operate without broader clinical contextualization. More broadly, many current CDSS lack integrated adaptive urgency reasoning, mandatory safety verification, retrieval of analogous cases, and auditable provenance within a unified workflow, despite these properties being central to trustworthy adoption in acute-care environments [9,10].

Large language models (LLMs) and retrieval-augmented generation (RAG) approaches have demonstrated strong understanding of clinical narratives and promising performance on medical reasoning tasks [11,12,13]. However, deploying LLMs without coordination introduces well-documented safety risks, including hallucinated treatments, overconfident recommendations, and the absence of mechanisms for enforcing mandatory safety checks [14,15]. These limitations are particularly consequential in emergency medicine, where an incorrect recommendation may directly harm a patient before it can be detected and corrected.

Recent agentic AI frameworks address these limitations by decomposing complex workflows into coordinated specialist module calls governed by a planning component [16,17,18,19]. Clinical reasoning is inherently multi-step, encompassing information gathering, urgency assessment, evidence retrieval, treatment planning, and safety verification, making it naturally suited to modular orchestration.

Combining agents with RAG extends this capability further. RAG grounds generation in retrieved evidence [20,21], while the orchestration layer adds dynamic routing, mandatory safety checkpoints, parallel verification, and conservative fallback strategies when confidence is low [22,23]. In crisis medicine, where a single presentation may require case retrieval, drug interaction screening, allergy assessment, policy checking, and treatment drafting in rapid succession, such orchestration is not merely beneficial but necessary.

Agent systems also extend the source-level traceability of RAG into a structured record of the decision-support process. While RAG systems can attribute generated text to retrieved documents, they do not necessarily capture why a retrieval was performed, what actions followed from it, or whether required safety checks were completed. In this work, we propose an agentic crisis CDS system in which every module invocation, retrieved case, safety check, and policy constraint is logged within a structured hand-off record, producing a decision record that supports clinical review, retrospective audit, and medicolegal accountability. This level of provenance directly addresses one of the major barriers to clinical adoption of AI-based decision support.

TriAgent’s contribution is architectural: the principled composition of established techniques, multi-agent coordination, retrieval augmentation, and structured critique, into a single crisis-oriented workflow defined by adaptive orchestration, structurally enforced safety, and end-to-end auditability. An Orchestrator Agent performs intake, validation, and triage, then adaptively selects and sequences specialist modules, routing critical presentations directly to an emergency pathway. Safety is enforced by the workflow’s structure rather than appended to it: medication, allergy, and policy checks are mandatory steps, and a Critique Agent reviews outputs for medical plausibility and safety consistency before release. Every module call generates a structured hand-off record, giving each decision end-to-end provenance for governance and retrospective review. Our evaluation measures these architectural properties; whether they translate into real-world clinical safety can only be answered by prospective, clinician-involved study.

2. Related Work

We situate the proposed system relative to prior work in CDSS, clinical retrieval-augmented generation, agentic AI, and medication safety systems.

Rule-based CDSS have demonstrated measurable benefits for process adherence, prescribing quality, and medication error reduction when clinical inputs are complete and well structured [24,25]. Their principal limitation is brittleness under missing data: when required fields are absent, rule conditions cannot trigger, often resulting in generic outputs or no recommendation at all. Machine learning CDSS improve robustness to noisy inputs and have shown strong predictive performance on retrospective electronic health record (EHR) data [3,4,26]. However, many such systems are developed for complete-record settings and do not directly address real-time triage under incomplete-information conditions.

RAG reduces hallucination and improves source attribution by grounding language model outputs in retrieved documents [20,21]. In clinical NLP, where the ability to extract structured information from unstructured narrative text underpins downstream decision support [27,28], RAG-style methods have shown promise for question answering, summarization, discharge documentation, and differential diagnosis support. Recent clinical RAG systems have demonstrated this potential in concrete settings, including EHR-grounded question answering, similarity search, and report summarization over MIMIC-IV records [29], and personalized cardiovascular risk assessment delivered through a retrieval-augmented chatbot interface [30]. Nevertheless, most current systems rely on single-pass retrieval. When a query is vague, incomplete, or semantically ambiguous, single-pass retrieval may return superficially similar cases with limited clinical relevance. General-domain work has begun to address this limitation through self-reflective retrieval, in which the model decides on-demand whether to retrieve and critiques its own generations [31]. Our approach extends this direction to clinical RAG via an iterative retrieval module that performs relevance grading and query reformulation before downstream generation proceeds.

Agentic AI frameworks have demonstrated the value of decomposing complex tasks into coordinated reasoning and tool-use steps. Chain-of-thought prompting showed that explicit step-by-step reasoning improves multi-step problem solving in language models [32]. Building on this, ReAct introduced interleaved reasoning traces with tool invocation, improving interpretability and controllability [18]. Subsequent approaches, such as Reflexion and Self-Refine showed that iterative self-critique can improve outputs without parameter updates [22,23]. These developments are particularly relevant to medicine, where multi-step reasoning, verification, and conservative fallback behaviour are essential. Recent work has begun to instantiate this paradigm clinically through multi-agent collaboration frameworks in which role-specialized LLM agents discuss and refine answers to medical reasoning tasks [33]. Our system adopts this paradigm through an Orchestrator Agent that coordinates specialist modules and a downstream Critique Agent that verifies safety completeness, evidence grounding, and medical plausibility.

Medication safety remains a core requirement for deployable CDS. Dedicated interaction systems can detect drug-drug contraindications and provide management recommendations, but they are usually deployed as standalone utilities rather than embedded components of broader clinical reasoning workflows. The D3 Drug-Drug-Interaction classifier provides structured interaction prediction with explanatory outputs and management guidance, making it suitable for callable integration within a modular CDS pipeline [34]. Allergy risk assessment presents a related but distinct challenge, requiring reasoning over prior reactions, compound similarity, likely cross-reactivity, and safer alternatives. These tasks often exceed simple lookup-table logic and benefit from contextual reasoning.

These limitations motivate a unified crisis-oriented CDS architecture combining adaptive triage, iterative retrieval, embedded safety verification, emergency escalation, and end-to-end auditability.

3. Methodology

This section describes the proposed TriAgent system, including the dataset, system architecture, specialist components, operational infrastructure, evaluation and experimental protocol.

3.1. Dataset

Developing and evaluating AI systems in healthcare is frequently constrained by restricted access to clinical data and stringent privacy regulations. In the present setting, evaluation requires a corpus spanning triage severity, medication profiles, and allergy configurations. No suitable public benchmark currently exists, as available resources such as MIMIC-IV primarily contain retrospective discharge documentation rather than incomplete real-time intake scenarios. We therefore constructed a purpose-built evaluation dataset by linking two real, de-identified MIMIC-IV resources: the MIMIC-IV-Note discharge corpus [35] and the MIMIC-IV-ED emergency-department module [36], which records, for each visit, the Emergency Severity Index (ESI) acuity assigned by the triage nurse, the measured triage vital signs, the chief complaint, and the arrival mode. The resulting cohort comprises 1000 cases balanced across the three urgency tiers (334 critical, 333 urgent, 333 routine).

The ground-truth urgency tier for each case is the ESI acuity assigned by the triage nurse at the time of presentation, mapped as ESI-1 → critical, ESI-2 → urgent, and ESI levels 3–5 → routine. The reference standard is therefore external to the system: it reflects a real clinical judgment made at the point of care rather than a rule applied to the text the system reads. ESI is nonetheless a human judgement with imperfect reliability; a meta-analysis of 19 studies reports pooled inter-rater agreement of

κ \approx 0.79

, with study-level values from

0.46

to

0.98

and nurse-versus-reference agreement near

0.73

[37], which bounds the agreement any automated system can achieve against this standard. Each case additionally carries its measured triage vitals, documented allergies, and home medications, which exercise the medication- and allergy-safety pathways.

The only synthetic element is the model input. Each discharge note is converted into a fragmented, first-person patient narrative that suppresses explicit diagnoses and exact vital values, mimicking the incomplete descriptions typical of crisis presentations. This narrative is the sole free-text input the system receives at inference; the structured ESI label, vitals, and chief complaint serve only as ground truth or, where stated, as structured triage inputs. The underlying clinical data (ESI acuity, vitals, medications, and discharge notes) are therefore real; only the input narratives are synthesized, so the study is best characterized as a retrospective evaluation on real clinical data under synthesized incomplete-information inputs. The use of generated input narratives is consistent with prior evidence that synthetic clinical text can preserve key downstream characteristics [38]. The composition of the resulting evaluation cohort, from ground-truth ESI tier through routing pathway to documented safety profile, is summarized in Figure 1.

A separate retrieval index of 49,000 real discharge notes supports the specialist modules. The index is patient-disjoint from the evaluation cohort (zero subject-identifier overlap), so no evaluation case, nor any note from the same patient, can surface during similar-case retrieval.

3.2. System Overview

Figure 2 illustrates the end-to-end architecture of the proposed system. The system accepts patient input either as structured JSON or as free-text natural language. An intake parser converts the input into a typed case state containing symptoms, vital signs, medications, allergies, and available past medical history. This state is passed to the Orchestrator Agent, which dynamically selects and sequences specialist modules according to the evolving clinical context. The architecture emphasizes adaptivity, safety by design, and auditability through selective module invocation, embedded safety checks, and structured execution tracing.

The standard continuation pathway consists of urgency assessment, case retrieval, treatment drafting, safety verification, policy review, note compilation, and final critique. Critical presentations follow an emergency bypass pathway in which escalation is triggered immediately to preserve response time.

3.3. Orchestrator Agent

The Orchestrator Agent is the central control component of the system. It receives the parsed case state and dynamically selects from the available specialist modules. The agent is implemented using GPT-5-nano and is responsible for routing, sequencing, evidence gathering, treatment-planning coordination, safety-verification ordering, and termination decisions. GPT-5-nano was selected for the orchestration role because it supports controllable reasoning effort while offering low-latency execution suitable for routing, sequencing, and termination decisions within an iterative ReAct loop [39]. A single backbone serves the whole pipeline at two reasoning-effort levels, with temperature

= 0

and a token budget of 8192. In development, lower reasoning settings sometimes caused repeated tool invocation until the iteration cap, supporting the use of medium effort for control-point decisions. Execution is capped at three complete pipeline passes, where one pass comprises the full tool loop followed by critique; beyond this cap the case is finalized from the accumulated state.

System behavior depends in part on prompt design: the orchestrator, urgency-assessment, critique, and safety-policy modules each use structured prompts that constrain output format and enforce mandatory steps. The source code, including all prompt templates, is publicly available at https://github.com/serag-ai/TriAgent (accessed on 12 June 2026).

Urgency assessment is always performed first. Emergency escalation overrides all other logic and exits the standard pathway immediately. Treatment drafting must be followed by appropriate safety checks before final recommendation generation. When confidence is low, the Orchestrator Agent may request additional evidence, invoke further modules, or attach conditional (if/then) contingency orders in place of a passive hold (Section 3.7). The continuation loop terminates once sufficient evidence has been gathered to generate a final recommendation.

Algorithm 1 presents the canonical execution path of the agentic orchestrator for a continuation case with both medications and documented allergies. The canonical tool invocation sequence is: urgency assessment → similarity search → draft planning → [drug screening ‖ allergy assessment] → safety verdict → note compilation, where ‖ denotes concurrent execution via ThreadPoolExecutor. The Orchestrator Agent is not constrained to this order beyond the mandatory initial urgency assessment; it reasons dynamically at each step and may invoke any tool zero or more times per execution.

Algorithm 1: Agentic orchestrator workflow (canonical continuation path)

3.4. Composable Specialist Modules

The Orchestrator Agent invokes specialist modules grouped into three functional categories.

3.4.1. Reasoning and Planning Modules

The urgency assessment module classifies cases as critical, urgent, or routine. Because reasoning by the language model alone collapses toward the modal tier and anchors on the presenting complaint (Section 4.4), this module is built around a dedicated triage classifier based on gradient boosting (Section 3.5); the language model contributes structured features and the differential of diagnoses that must be excluded first, rather than the final tier. The similarity search module retrieves analogous cases from the 49,000 discharge notes, which are disjoint by patient from the evaluation cohort, using all-MiniLM-L6-v2 sentence embeddings (384 dimensions, normalized to unit length) over an exact FAISS IndexFlatIP index [40], which implements cosine similarity via inner product. For each query, the top

k = 5

candidates are graded for relevance by the language model against an acceptance threshold of

0.5

; queries falling below it trigger at most one rewrite before the best available results are passed downstream and flagged as low confidence. The draft plan module then generates candidate interventions grounded in this retrieved evidence while explicitly documenting uncertainty arising from incomplete information.

We implemented this retrieval process using an agentic RAG framework [41], where autonomous reasoning agents dynamically refine searches, assess evidence quality, and coordinate information flow. This approach improves robustness, contextual relevance, and transparency, particularly in high-stakes scenarios with incomplete or ambiguous clinical data.

3.4.2. Safety and Knowledge Modules

The drug safety module uses the D3 Drug-Drug-Interaction classifier [34] to evaluate medication conflicts. The allergy risk module assesses likely cross-reactivity using contextual reasoning supported by real-world adverse reaction data retrieved from the FDA Adverse Event Reporting System [42]. The safety policy module integrates these findings with evidence-based clinical safety guidance [43] into a final verdict, determining whether treatment proceeds as drafted, proceeds with modifications under conditional (if/then) orders, or is rejected; the verdict vocabulary deliberately contains no hold disposition (Section 3.7).

3.4.3. Memory and Documentation Modules

The case-history recall module retrieves similar prior decisions and outcomes from in-session memory (the

k = 3

nearest by case-state embedding). The clinical note module compiles the final structured recommendation, including approved actions, monitoring requirements, uncertainty disclosures, and evidence citations. The emergency escalation module bypasses the standard pathway and generates a structured escalation payload for critical presentations.

3.5. Triage Classification Head

The Emergency Severity Index is essentially threshold logic: danger-zone vitals such as

{SpO}_{2} < 92 %

or

SBP < 90

mmHg act as discrete cliffs, which linear or purely generative classifiers capture poorly; in development, a tree-based model recovered roughly ten accuracy points over a linear one. The triage head is therefore a gradient-boosted decision tree over two inputs: tier probabilities from a TF–IDF logistic regression on the chief complaint, stacked out-of-sample, and the seven triage vitals plus arrival mode. At inference, the chief complaint is inferred from the lay narrative by the same lightweight language-model call that seeds retrieval.

The head is trained on

415,423

MIMIC-IV-ED triage encounters, patient-disjoint from the evaluation cases, using a three-way split for the text model, the tree, and threshold tuning. Missing vitals are left unimputed, since absent charting is itself predictive of high acuity (the sickest patients bypass routine charting). A single threshold on the critical-class probability (

τ = 0.183

, tuned on a held-out split) assigns the critical tier and triggers an escalation recommendation to the orchestrator, providing an explicit dial between critical sensitivity and over-triage. The language model’s urgency judgment is retained only as the worst-case differential used for planning (Section 3.7), not as the final tier.

3.6. Critique Agent

Before finalisation, non-emergency outputs are reviewed by a downstream Critique Agent, which examines the complete reasoning trace using four structured checks:

Safety path completeness. If treatment planning was executed, at least one safety tool and the safety policy module must also appear in the trace.
No blocked actions in note. Actions flagged as blocked by the drug conflict screener must not appear in the compiled clinical note.
Evidence retrieval for non-critical cases. If urgency is not critical, the evidence retrieval tool must have been called.
Medical plausibility. A secondary LLM (Qwen3.5-27B) checks proposed medications and actions for fabricated drug names or clinically implausible interventions.

The plausibility check deliberately uses a model family different from the orchestrator’s, so that the system critiquing recommendations shares no foundation with the system producing them; this separation reduces self-confirmation bias. A Qwen model was chosen because the family is among the strongest open-weight performers on medical tasks: Qwen-based models fine-tuned for medical question answering have matched or exceeded larger proprietary models on licensing-exam benchmarks [44], consistent with the broader emergence of strong open-weight medical reasoners [45]. Using a stock model rather than a medical fine-tune keeps the critique layer simple and independently reproducible.

A structured critique report is appended to the audit trace. If any check fails and the orchestrator has not yet retried, a feedback message is injected into the conversation and the loop is re-entered for targeted remediation.

3.7. Safety-Oriented Mitigations

The execution pathway embeds three targeted, largely deterministic architectural safeguards intended to reduce the likelihood of known failure modes to which language-model-driven decision support is prone: anchoring on the presenting complaint [7], passive deferral of decisions, and unsupported or contraindicated medication suggestions [14]. Whether these safeguards improve patient outcomes can only be confirmed through prospective evaluation; their inclusion here is justified by their structural properties.

3.7.1. Worst-Case Differential Enumeration

To counter anchoring, the urgency-assessment step enumerates the can’t-miss diagnoses for every case. For each plausible serious diagnosis it records the triggering features, the test that would exclude it, whether it can be excluded immediately, and the tier it would imply if confirmed. The resulting rule-out list is passed to the planning module, which must open the plan with the corresponding emergent workup. This differential shapes the plan but has only bounded influence on the tier: a serious possibility that cannot yet be excluded raises acuity to at most urgent, never automatically to critical, because treating diagnostic uncertainty as high acuity is the dominant ESI over-triage error. Genuinely peri-arrest presentations still reach the critical tier through the classifier head and its highest-priority overrides.

3.7.2. Elimination of Passive Holds

The safety-policy vocabulary contains no hold-pending-clarification disposition: an emergency decision-support tool must act on incomplete data rather than suspend care. When data are missing, the verdict must instead approve with modification and attach explicit if/then conditional orders (for example, if

SBP < 90

mmHg, administer a fluid bolus and reassess), turning a passive hold into an actionable contingency plan. A deterministic post-processing step rewrites any residual hold.

3.7.3. Deterministic Medication Guardrails

A rule-based contraindication screen, independent of the language model, excludes unsafe combinations of drug and clinical context, keyed on the presenting features and the still-open differential: anti-motility agents in possible inflammatory or infectious diarrhea, loop diuretics before a pulmonary embolism has been excluded, and empirical antibiotics without an identified source. The planning prompt additionally requires every proposed drug to be explicitly indicated for the case at hand. The guard fires selectively, intervening only when a risky agent is actually proposed.

3.8. Operational Execution

Each module invocation produces a structured hand-off record containing inputs, outputs, confidence estimates, caveats, and routing metadata (see Appendix B). These records are appended sequentially to form an end-to-end audit trace suitable for retrospective review, governance, and medicolegal accountability.

When both medication safety and allergy assessment are indicated, the two modules execute concurrently to reduce latency. Cases requiring only one safety pathway invoke only the relevant module.

Finalised recommendations are written to an in-session FAISS memory store (the Decision Memory in Figure 2) keyed by semantic embeddings of the case state. Clinician feedback, including whether recommendations were accepted, modified, or overridden, may be appended and retrieved in future cases through the case-history recall module.

3.9. Evaluation Protocol

Evaluation was organised into two complementary levels: component-level assessment, which examined the performance of individual modules in isolation, and system-level assessment, which evaluated the behavior of the integrated end-to-end framework.

Across all evaluations, critical-class recall is the pre-specified primary endpoint; overall accuracy, quadratic-weighted

κ

, AUC, and the Brier score are supporting metrics. The two LLM-based measures (retrieval relevance grading and the medical-plausibility check) are internal quality indicators and should be interpreted as such rather than as independent validation.

3.9.1. System-Level Assessment

Orchestration. Routing decisions and module invocation patterns were examined qualitatively using representative execution cases to verify adaptive pathway.

Selection, correct escalation behavior, and end-to-end traceability. Triage Classification Performance. The system’s triage performance was evaluated against the real ESI-derived urgency tiers, reporting per-class precision, recall, and F1-scores for critical, urgent, and routine labels, alongside quadratic-weighted agreement, threshold-free discrimination (AUC), calibration (Brier score), and bootstrap confidence intervals.

Baseline Comparison. The full system was compared with two reduced baselines, an LLM-only triage call and a RAG triage call, on the same 1000 cases and with the same underlying language model, so that any difference is attributable to system design rather than model capacity. Each baseline was run both on the narrative alone and with the same structured triage inputs (vitals and arrival mode) that the classifier head receives, ruling out information asymmetry as an explanation. Overall accuracy, critical recall, and paired-bootstrap confidence intervals were reported.

Expert Clinical Adjudication. An experienced physician informaticist reviewed a stratified 20-case subset, blind to the ground-truth label, rating each complete case output (triage tier, differential, disposition, and orders) on a three-level actionability scale: correct (would act as written), passable (would act with modifications), or wrong (would not act). This review assesses whether outputs are clinically actionable, a property that agreement with ESI labels alone cannot capture.

Efficiency and Runtime Performance. End-to-end latency was measured and stratified by routing pathway, together with the orchestrator’s internal dynamics (tool-call count, critique remediation frequency, and token cost). Audit trace structure was reviewed to confirm that routing decisions, evidence sources, and safety outcomes were captured along each evaluated pathway.

3.9.2. Component-Level Assessment

Retrieval Evaluation. Single-pass retrieval was compared with the full iterative similarity search module using LLM-scored relevance, cases above threshold (0.5), rewrite frequency, and low-confidence flags.

Drug Safety Validation. The D3 Drug-Drug-Interaction classifier was evaluated against DDInter [46] as an independent reference standard. Performance was assessed by interaction severity tier (major, moderate, and minor), with major-interaction sensitivity as the primary endpoint, given its direct clinical consequence. This is deliberately a cross-source evaluation: D3 is fine-tuned predominantly on DrugBank-derived interaction data, whereas DDInter is a separate, independently curated database, so the validation tests generalization to an external reference standard rather than agreement within a shared ecosystem.

Critique Agent Contribution. The full system was compared with a variant in which the Critique Agent was removed and outputs were passed directly to note compilation. In addition to a side-by-side case analysis, the frequency, tier-targeting, and downstream effect of critique-triggered remediation were quantified across the instrumented continuation-pathway cases.

4. Results

4.1. Orchestration

Figure 3 illustrates three representative execution cases generated by the deployed prototype. Case A shows a critical presentation routed directly to emergency escalation, bypassing non-essential modules. Case B shows a continuation pathway requiring medication safety review only. Case C shows a continuation pathway in which medication and allergy checks are executed concurrently before final recommendation generation.

Across all cases, the Orchestrator Agent selected different module sequences without hardcoded routing rules while maintaining complete audit records. These examples demonstrate adaptive pathway selection, correct escalation behavior, and end-to-end traceability.

4.2. Triage Classification Performance

Table 1 summarizes per-class triage performance on the real-ESI cohort (

N = 1000

). The system reaches

65.7 %

overall accuracy (95% confidence interval (CI)

62.9

–

68.5

), corresponding to a quadratic-weighted agreement of

κ = 0.49

(95% CI

0.43

–

0.54

) with the nurse-assigned ESI label. For context, human raters themselves agree imperfectly on ESI: reported study-level

κ

ranges from

0.46

to

0.98

, with a pooled estimate near

0.79

, and nurses agree with a reference standard at roughly

0.73

[37]. The system therefore reaches the lower end of the human range while observing only a fragmented first-person narrative rather than the full triage encounter.

On the primary safety endpoint, the system recalls

85.0 %

of critical cases, where critical recall is the proportion of ground-truth critical (ESI-1) cases that the system assigns to the critical tier. The triage head separates critical from non-critical presentations well (AUC

0.92

) and its probabilities are well calibrated (Brier

0.09

).

When the system errs, it errs by one tier:

96.3 %

of predictions fall within one tier of the true label, and the errors skew conservative, with over-triage (

22.8 %

) roughly double under-triage (

11.5 %

). Per class (macro-F1

0.65

), critical presentations are identified reliably, the urgent middle tier is the hardest to separate, and routine assignment is conservative.

4.3. Comparison with Alternative System Configurations

To isolate the effect of architecture, we compared TriAgent with two alternative system configurations on the same 1000 cases, holding the language model and retrieval corpus fixed. Standalone LLM assigns a triage tier from a single model call; Traditional RAG augments that call with the five most similar retrieved cases. All systems received identical inputs, the patient narrative together with the structured triage record (vitals and arrival mode), and the two alternatives used a detailed ESI decision-rule prompt. Table 2 reports the results: TriAgent improves overall accuracy by 22 points and critical recall by 70 points over the best alternative. The gap is architectural rather than informational; in a narrative-only condition the alternatives scored only about five points lower, with critical recall essentially unchanged. Lacking the classifier triage head, the agentic orchestration layer, and the confidence-gated emergency routing, both alternatives default to the lower-acuity majority tiers and miss nearly all life-threatening presentations.

4.4. Expert Clinical Evaluation

Automated agreement with ESI labels captures triage calibration but not whether a clinician would act on the system’s full output. To complement the automated metrics with expert judgment, an experienced physician informaticist independently reviewed a stratified 20-case subset (7 critical, 7 urgent, 6 routine) drawn from the same real-ESI cohort (single reviewer; formative intent), blind to the ground-truth labels, rating each complete case output on the three-level actionability scale defined in Section 3.9. This evaluation was conducted on an earlier iteration of the system and was deliberately formative: its purpose was to surface failure modes that automated tier-accuracy could not.

The evaluation yielded encouraging signals alongside clear targets for improvement. Twelve of 20 outputs (

60 %

) were rated clinically acceptable (correct or passable), with 5 (

25 %

) actionable as written; the reviewer highlighted conservative over-triage and broad differential awareness as genuine strengths, summarizing the system as “a great beginning.” All triage errors were confined to an adjacent tier. The evaluation also pinpointed two specific, correctable failure modes: anchoring on the presenting complaint, which collapsed predictions toward the modal urgent tier (

13 / 20

) and limited critical-case sensitivity to

3 / 7

, and a “clinical passivity” pattern in which

11 / 20

dispositions deferred to hold pending clarification rather than committing to a worst-case-driven plan.

These expert-identified failure modes directly motivated the safety-oriented refinements in the present system: explicit worst-case differential enumeration to counter chief-complaint anchoring, and dynamic conditional (if/then) orders to replace passive holds. The substantially higher critical recall reported above (

85.0 %

versus

3 / 7

on the earlier iteration) is a direct outcome of these changes. To close the loop, the same 20-case subset was re-evaluated on the present system: all 20 outputs (

100 %

) were now rated clinically acceptable, with no wrong ratings, no passive holds, and no flagged medication issues. We report the formative evaluation in full because it is itself the evidence that the system was assessed against human clinical judgment, not merely against recovered label rules. Given its limited scale, these actionability ratings should be read as directional signals rather than definitive performance estimates; larger, multi-reviewer adjudication is planned as future work.

4.5. Mitigation Effects at Scale

We verified that the three mitigations (Section 3.7) behave as intended across the full 1000-case cohort. The worst-case differential pass produced a non-empty rule-out list in

96.0 %

of cases, confirming that the anti-anchoring mechanism engages near-universally. The elimination of passive holds was complete:

0 / 1000

cases terminated in a “hold pending clarification” disposition, compared with

11 / 20

in the earlier iteration assessed in the expert evaluation (Section 4.4); every case instead reached an actionable disposition (778 approve-with-modification, 216 emergency-bypass, 5 approve-as-written, 1 reject; no missing dispositions). The deterministic medication guard fired selectively rather than indiscriminately, excluding at least one proposed action in

13.1 %

of cases (

131 / 1000

), distributed across the urgent (55), routine (45), and critical (31) tiers in proportion to where risky agents were actually proposed. These engagement patterns translate directly into the outcome gains reported above: critical recall rose from

3 / 7

on the earlier iteration to

85.0 %

at scale. Together they confirm that the mitigations operate at population scale with the intended targeting, not merely on the hand-picked cases that motivated them.

4.6. Efficiency and Runtime Performance

Latency and agent dynamics were measured over instrumented runs on the full 1000-case cohort. Adaptive routing cleanly separates the two pathways (Table 3): emergency bypass (

21.6 %

of cases) skips retrieval, planning, safety verification, and critique and completes in a median of 64 s, while the full continuation pathway takes a median of 186 s (p90 268 s). Continuation latency is dominated by a fixed reasoning-decode base of ≈105 s, with each tool invocation adding ≈12 s; a critique remediation round-trip, triggered in

38.2 %

of instrumented continuation cases (Section 4.10), adds a further ∼15 s. The orchestrator issues a median of 7 tool calls per case at an average cost of ∼54 k tokens, and every case yields both a triage decision and a safety disposition. Since the bypass pathway governs response time on the presentations where speed matters most, deployment-relevant latency is roughly a third of the continuation figure.

4.7. Reproducibility, Prompt Robustness, and Backbone Sensitivity

Because the orchestrator is language-model-driven, we measured how stable its outputs are under repetition and under a change of backbone model, in both cases on the full 1000-case cohort.

4.7.1. Run-to-Run Variability

We ran the full pipeline three times. The safety-critical emergency-bypass decision was the most stable component, identical across all three runs in

93.3 %

of cases, and the final disposition was identical in

80 %

. The finer exact-tier assignment was less stable (identical across all three runs in

50 %

of cases): the otherwise deterministic classifier head is fed a chief complaint inferred by the language model, whose wording varies between runs, and the resulting tier changes were confined to adjacent tiers. Per-run overall accuracy averaged

64.4 %

with a standard deviation of

5.7

points across the three runs, while the number of orchestrator tool calls per case varied little (mean per-case standard deviation

0.6

). The system is therefore highly reproducible at the level of the safety-critical routing decision and only moderately reproducible at the level of the exact tier.

4.7.2. Backbone Sensitivity

To test whether the findings depend on the specific backbone, we re-ran the pipeline with an independent model from a different family (Claude Sonnet 4.6) substituted for GPT-5-nano, leaving the architecture unchanged. The two backbones produced near-identical overall accuracy and critical recall, agreed on the triage tier in

92.3 %

of cases, and the alternative backbone returned a complete disposition on every case, indicating that the results reflect the system design rather than an idiosyncrasy of one model.

4.7.3. Prompt Robustness

A further source of variability not captured by the run-to-run analysis is prompt wording. The structured prompts used for each module were finalized on a held-out development set and held constant across all reported runs; minor rewording of the orchestrator or critique prompts in pilot experiments produced comparable routing behavior, indicating robustness to minor prompt variations. A systematic prompt ablation is outside the scope of the present work and is left to future study.

4.8. Retrieval Evaluation

We compared the iterative retrieval module against a single-pass search strategy, as shown in Table 4. Iterative retrieval markedly improved mean relevance score and the proportion of cases exceeding the acceptance threshold. A total of 312 cases required at least one rewrite, most of which were successfully resolved within the first refinement cycle. Only a small minority (33 cases, 3.3%) remained below threshold and were automatically flagged as low-confidence, preserving transparency for downstream decision-making.

4.9. Drug Safety Validation

Table 5 presents validation of the D3 Drug-Drug-Interaction classifier against DDInter. Performance was strongest for major interactions, the tier with the greatest potential clinical consequence, where sensitivity exceeded 96%. Performance declined modestly for moderate and minor interactions, while specificity remained high across all tiers.

The overall false positive rate was low, consistent with a caution-alert decision-support setting in which suspicious combinations are surfaced for review rather than automatically blocked. As noted in Section 3.9, this is a cross-source evaluation: D3 is fine-tuned predominantly on DrugBank-derived interaction data, while DDInter is an independently curated reference, so these figures reflect generalization to an external standard rather than agreement within a shared ecosystem.

4.10. Critique Agent Contribution

To isolate the contribution of the Critique Agent, the full system was compared with a variant in which critique was removed and outputs passed directly to note compilation. Figure 4 shows a representative case involving medication uncertainty and a conflicting allergy profile. With critique enabled, failed consistency checks triggered a remediation pass and added a pharmacist review flag; without it, the same uncertainty remained implicit in the reasoning trace and never surfaced in the final recommendation. Critique rarely altered straightforward cases, but it consistently improved outputs containing incomplete safety checks, conflicting confidence signals, or latent pharmacological inconsistencies.

To move beyond a single illustrative case, we quantified how often, and how consequentially, critique intervened across the instrumented continuation cases, that is, the continuation cases executed with detailed critique logging enabled (

n = 395

of 784; emergency-bypass cases skip critique by design). Three findings emerge. First, critique engages frequently: at least one remediation round-trip was initiated in

38.2 %

of continuation cases. Second, it targets the cases that matter most: remediation was triggered in

48.4 %

of cases assigned to the critical tier, versus

34.6 %

of urgent and

35.5 %

of routine cases, so the loop engages disproportionately on the most consequential presentations rather than uniformly. Third, it has a measurable downstream effect: remediated cases were roughly twice as likely to end with at least one candidate action explicitly excluded on safety grounds (

20 %

versus

11 %

for non-remediated cases). The qualitative pattern in Figure 4, critique surfacing and pruning latent safety concerns, is thus established at the population level. A full pre-versus-post recommendation diff would require logging the intermediate plan state around each remediation pass and is left to future instrumentation; the present measurements establish frequency, targeting, and downstream effect.

5. Discussion

Taken together, the experiments support one conclusion: in agentic clinical decision support, performance is governed as much by what information reaches each decision stage, and how it is routed through the workflow, as by the quality of the reasoning model itself. The iterative retrieval module raised evidence relevance well above a single-pass strategy, and the embedded medication and allergy screens ensured safety coverage on every continuation pathway. Future gains may therefore come as much from better retrieval resources and structured information flow as from larger reasoning models.

Adaptive orchestration is what delivers this information economy in practice. Presentations assigned the critical tier were recommended for immediate escalation, with the orchestrator routing them to the emergency pathway, while non-critical cases invoked only the modules their context required: medication-only cases bypassed allergy analysis, and cases with both medications and allergies triggered the two safety checks in parallel. A deterministic pipeline, which applies the same sequence of steps to every case regardless of complexity, offers neither the economy nor the adaptivity.

Safety rested on two complementary mechanisms. Medication conflict screening, allergy-risk assessment, and policy review were embedded inside the execution pathway rather than appended as optional downstream checks, and the Critique Agent re-examined completed outputs, triggering remediation whenever they were incomplete, weakly supported, or internally inconsistent. We are careful about what this does and does not demonstrate. Throughout this paper, “safety” refers to structural system properties. The gains we measure are internal system properties: more complete safety-check coverage, remediation of inconsistent or weakly supported outputs, and structured provenance. They are not, by themselves, evidence of improved real-world clinical safety; embedded checks make unsafe omissions less likely and more auditable, but a clinical-safety benefit can only be confirmed through prospective evaluation with clinicians in the loop.

The same structural discipline is what makes the system auditable. Every module invocation generated a hand-off record of routing decisions, evidence sources, outputs, and confidence metadata, yielding complete pathway provenance for every evaluated case. This supports retrospective review, quality assurance, and governance, and remains difficult to achieve in loosely coupled retrieval pipelines, where intermediate decisions are often opaque or incompletely logged.

The principal trade-off for these properties is runtime. Emergency escalation completed in a median of 64 s by short-circuiting planning, retrieval, and verification, whereas the full continuation pathway, with retrieval, treatment planning, safety review, and critique, took a median of 186 s. The continuation figure is substantial for acute care, and we do not claim it is ready to serve as a blocking step; the design intent is that the fast bypass pathway governs the time-critical presentations while the slower deliberative pathway serves non-emergent cases, where a few minutes of added latency is clinically tolerable and is repaid in safety coverage and auditability. Because the latency is dominated by reasoning-model decode rather than tool execution (Section 4.6), it would also fall substantially with a faster backbone or hardware. Realistic integration would therefore position the system as an asynchronous second reader that surfaces a structured, audited recommendation shortly after triage, not as a synchronous bedside oracle.

Several limitations should be acknowledged. First, the benchmark dataset was derived from retrospective MIMIC-IV discharge notes rather than true point-of-care intake data. We therefore characterize this work as a retrospective evaluation on real clinical data under synthesized incomplete-information inputs, intended to evaluate orchestration, routing, and safety-verification behavior rather than to demonstrate real-world clinical performance: although the ground-truth acuity comes from real triage decisions, the first-person narratives are reformulated from structured discharge documentation and cannot reproduce the noise, missing information, and temporal evolution of live presentations.

Second, retrieval performance remains dependent on corpus coverage, particularly for rare or underrepresented scenarios. Third, LLM-based orchestration introduces some run-to-run variability, quantified in Section 4.7: the emergency-bypass decision is highly stable across repeated runs, but the finer exact-tier assignment varies in a non-trivial fraction of cases, because the otherwise deterministic classifier head consumes a chief complaint that is itself inferred by the language model.

Fourth, two secondary evaluation components, retrieval relevance grading and the medical-plausibility check, rely on LLM-based scoring. Neither produces the headline results, which are anchored to external references (clinician-assigned ESI acuity, the independently curated DDInter database, and blinded expert review), and the plausibility check uses a model family different from the orchestration backbone so that no component evaluates its own outputs; comprehensive clinician annotation across a substantially larger cohort nevertheless remains planned future work. Fifth, prompt robustness has so far been assessed only informally: all prompts were held constant across the reported runs, and minor rewording in pilot experiments produced comparable routing behavior (Section 4.7), but a systematic ablation of prompt wording was not conducted, and such an ablation is needed before contributions can be attributed to architecture rather than to prompt design. Finally, the framework has not yet been prospectively validated with clinicians in live operational settings; prospective point-of-care validation, including establishing acceptable response times against clinician decision time, is a prerequisite for any deployment claim.

Future work will focus on expanding retrieval resources to include emergency and pediatric datasets, improving terminology normalization for medications and supplements, calibrating policy thresholds using clinician feedback, and conducting prospective studies in real clinical environments. These steps are necessary to determine whether the gains observed in retrospective benchmarking translate into safer and more efficient bedside decision support.

6. Conclusions

We presented an agentic framework for crisis clinical decision support under incomplete information that integrates adaptive orchestration, specialized safety modules, iterative retrieval, structured critique, and full audit tracing within a unified workflow. Across evaluation experiments, the system demonstrated improved retrieval quality, robust safety verification, and efficient emergency escalation while maintaining complete pathway provenance. These findings suggest that clinically useful AI decision support may depend less on any single modeling advance than on how reasoning, specialized tools, mandatory verification, and transparency are coordinated within a single architecture. Because the evaluation uses retrospective records, with synthesized rather than live patient narratives as inputs, these results characterize the system’s internal behavior rather than its clinical effectiveness; establishing the latter requires prospective validation with clinicians in the loop. Within that scope, however, the present results support agentic orchestration as a promising direction for trustworthy AI-assisted crisis care.

7. Declaration of Generative AI and AI-Assisted Technologies in the Manuscript Preparation Process

During manuscript preparation, Claude Sonnet 4.6 was used for editorial refinement of language. All content was reviewed and verified by the authors, who take full responsibility for the final version.

Author Contributions

Conceptualization, A.I. and A.S.; methodology, A.I. and A.S.; software, A.I. and A.S.; validation, A.I., A.A. and A.S.; formal analysis, A.I. and A.S.; investigation, A.I. and A.S.; resources, A.S.; data curation, A.I. and A.S.; writing—original draft preparation, A.I. and A.S.; writing—review and editing, A.I., A.A. and A.S.; visualization, A.I. and A.S.; supervision, A.S.; project administration, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The evaluation dataset was derived from MIMIC-IV-Note: Deidentified Free-Text Clinical Notes (v2.2) [35], available at https://physionet.org/content/mimic-iv-note/2.2/ (accessed on 12 June 2026). The triage classifier was trained on MIMIC-IV-ED (v2.2) [36], available at https://physionet.org/content/mimic-iv-ed/2.2/ (accessed on 12 June 2026). Access to both datasets was obtained through PhysioNet in accordance with the applicable data use agreements and credentialing requirements.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
CDS	Clinical Decision Support
CDSS	Clinical Decision Support System
D3	Drug-Drug-Interaction classifier
DDI	Drug-Drug Interaction
EHR	Electronic Health Record
FAISS	Facebook AI Similarity Search
LLM	Large Language Model
MIMIC	Medical Information Mart for Intensive Care
NLP	Natural Language Processing
RAG	Retrieval-Augmented Generation
${SpO}_{2}$	Peripheral oxygen saturation

Appendix A. Evaluation Cohort Composition

The 1000-case evaluation cohort was drawn from MIMIC-IV-ED with ground-truth urgency taken directly from the real nurse-assigned Emergency Severity Index (ESI) acuity recorded at triage, rather than from any deterministic rule applied to the note text. The three urgency tiers map onto ESI acuity as critical (=ESI-1), urgent (=ESI-2), and routine (≥ESI-3), and the cohort is balanced across tiers by construction (Table A1). Each case carries the real discharge narrative, the measured triage vitals, the ED chief complaint, primary diagnosis, and disposition; the retrieval corpus of

49,000

discharge notes is patient-disjoint from these cases (verified zero subject overlap).

Table A1. Composition of the 1000-case evaluation cohort by ground-truth ESI acuity and mapped urgency tier. Acuity is the real triage-assigned ESI from MIMIC-IV-ED.

ESI Acuity	Urgency Tier	Cases
ESI-1 (resuscitation)	Critical	334
ESI-2 (emergent)	Urgent	333
ESI-3 (urgent)	Routine	328
ESI-4 (less urgent)	Routine	5
Total		1000

The most frequent chief complaints in the cohort are abdominal pain (

n = 95

), dyspnea (58), fall (49), chest pain (42), and fever (35), reflecting a representative emergency-department case mix. Because the evaluation narratives were derived from discharge notes, the cohort is skewed toward admitted patients (

96.5 %

ADMITTED); we therefore treat ED disposition only as a descriptive field and exclude it as a model feature, since it is both post-triage and degenerate on this cohort.

Appendix B. Structured Audit Trace Schema

To ensure end-to-end provenance and reproducibility, every module invocation in the proposed framework generates a structured hand-off record that is appended sequentially to the audit trace. This mechanism guarantees that routing decisions, intermediate outputs, safety checks, and final recommendations remain inspectable after execution. Table A2 summarizes the fields contained in each hand-off record.

Table A2. Structured hand-off schema used to construct the audit trace.

Field	Description
module_name	Name of the invoked module or agent.
input_received	Structured inputs passed to the module at invocation.
reasoning_summary	Concise summary of the rationale or decision basis.
output	Structured result returned by the module.
confidence	Self-reported confidence score in the range $[0, 1]$ .
caveats	Missing information, warnings, or uncertainty notes.
next_step_hint	Optional recommendation for subsequent module calls.
timestamp	Completion time of the module invocation.

The audit trace is cumulative across the full pathway. Modules that are intentionally bypassed return predefined sentinel outputs so that omitted steps are also represented explicitly in the final trace.

References

Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
Johnson, K.B.; Wei, W.-Q.; Weeraratne, D.; Frisse, M.E.; Misulis, K.; Rhee, K.; Zhao, J.; Snowdon, J.L. Precision medicine, AI, and the future of personalized health care. Clin. Transl. Sci. 2021, 14, 86–93. [Google Scholar] [CrossRef] [PubMed]
Miotto, R.; Wang, F.; Wang, S.; Jiang, X.; Dudley, J.T. Deep learning for healthcare: Review, opportunities and challenges. Brief. Bioinform. 2018, 19, 1236–1246. [Google Scholar] [CrossRef]
Rajkomar, A.; Dean, J.; Kohane, I. Machine learning in medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef] [PubMed]
Adler-Milstein, J.; DesRoches, C.M.; Kralovec, P.; Foster, G.; Worzala, C.; Charles, D.; Searcy, T.; Jha, A.K. Electronic health record adoption in US hospitals: Progress continues, but challenges persist. Health Aff. 2015, 34, 2174–2180. [Google Scholar] [CrossRef] [PubMed]
Kruse, C.S.; Kristof, C.; Jones, B.; Mitchell, E.; Martinez, A. Barriers to electronic health record adoption: A systematic literature review. J. Med. Syst. 2016, 40, 252. [Google Scholar] [CrossRef] [PubMed]
Croskerry, P. A universal model of diagnostic reasoning. Acad. Med. 2009, 84, 1022–1028. [Google Scholar] [CrossRef] [PubMed]
Berner, E.S.; Graber, M.L. Overconfidence as a cause of diagnostic error in medicine. Am. J. Med. 2008, 121, S2–S23. [Google Scholar] [CrossRef] [PubMed]
London, A.J. Artificial intelligence and black-box medical decisions: Accuracy versus explainability. Hastings Cent. Rep. 2019, 49, 15–21. [Google Scholar] [CrossRef] [PubMed]
Rajkomar, A.; Hardt, M.; Howell, M.D.; Corrado, G.; Chin, M.H. Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 2018, 169, 866–872. [Google Scholar] [CrossRef] [PubMed]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef] [PubMed]
Ibrahim, A.; Hosseini, A.; Helmy, H.; Arabi, M.; AlShareef, A.; Lakhdhar, W.; Serag, A. MENARA: Medical natural Arabic response assistant. Mach. Learn. Knowl. Extr. 2026, 8, 110. [Google Scholar] [CrossRef]
Alkaissi, H.; McFarlane, S.I. Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus 2023, 15, e35179. [Google Scholar] [CrossRef] [PubMed]
Lee, P.; Bubeck, S.; Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 2023, 388, 1233–1239. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The rise and potential of large language model based agents: A survey. arXiv 2023, arXiv:2309.07864. [Google Scholar]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing reasoning and acting in language models. arXiv 2022, arXiv:2210.03629. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Adv. Neural Inf. Process. Syst. 2023, 36, 8634–8652. [Google Scholar] [CrossRef]
Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative refinement with self-feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 46534–46594. [Google Scholar] [CrossRef]
Sutton, R.T.; Pincock, D.; Baumgart, D.C.; Sadowski, D.C.; Fedorak, R.N.; Kroeker, K.I. An overview of clinical decision support systems: Benefits, risks, and strategies for success. npj Digit. Med. 2020, 3, 17. [Google Scholar] [CrossRef] [PubMed]
Kawamoto, K.; Houlihan, C.A.; Balas, E.A.; Lobach, D.F. Improving clinical practice using clinical decision support systems: A systematic review of trials to identify features critical to success. BMJ 2005, 330, 765. [Google Scholar] [CrossRef] [PubMed]
Helmy, H.; Ben Rabah, C.; Ali, N.; Ibrahim, A.; Hoseiny, A.; Serag, A. Optimizing ICU readmission prediction: A comparative evaluation of AI tools. In Applications of Medical Artificial Intelligence; Springer: Cham, Switzerland, 2025; pp. 95–104. [Google Scholar] [CrossRef]
Friedman, C.; Rindflesch, T.C.; Corn, M. Natural language processing: State of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J. Biomed. Inform. 2013, 46, 765–773. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Wang, L.; Rastegar-Mojarad, M.; Moon, S.; Shen, F.; Afzal, N.; Liu, S.; Zeng, Y.; Mehrabi, S.; Sohn, S.; et al. Clinical information extraction applications: A literature review. J. Biomed. Inform. 2018, 77, 34–49. [Google Scholar] [CrossRef] [PubMed]
Ibrahim, A.; Khalili, A.; Arabi, M.; Sattar, A.; Hosseini, A.; Serag, A. MERA: Medical electronic records assistant. Mach. Learn. Knowl. Extr. 2025, 7, 73. [Google Scholar] [CrossRef]
Lakhdhar, W.; Arabi, M.; Ibrahim, A.; Arabi, A.; Serag, A. ChatCVD: A retrieval-augmented chatbot for personalized cardiovascular risk assessment with a comparison of medical-specific and general-purpose LLMs. AI 2025, 6, 163. [Google Scholar] [CrossRef]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar] [CrossRef]
Tang, X.; Zou, A.; Zhang, Z.; Li, Z.; Zhao, Y.; Zhang, X.; Cohan, A.; Gerstein, M. MedAgents: Large language models as collaborators for zero-shot medical reasoning. In Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 599–621. [Google Scholar]
Ibrahim, A.; Hosseini, A.; Ibrahim, S.; Sattar, A.; Serag, A. D3: A small language model for drug-drug interaction prediction and comparison with large language models. Mach. Learn. Appl. 2025, 20, 100658. [Google Scholar] [CrossRef]
Johnson, A.; Pollard, T.; Horng, S.; Celi, L.A.; Mark, R. MIMIC-IV-Note: Deidentified Free-Text Clinical Notes. PhysioNet. 2023. Available online: https://physionet.org/content/mimic-iv-note/2.2/ (accessed on 12 June 2026).
Johnson, A.; Bulgarelli, L.; Pollard, T.; Celi, L.A.; Mark, R.; Horng, S. MIMIC-IV-ED. PhysioNet. 2023. Available online: https://physionet.org/content/mimic-iv-ed/2.2/ (accessed on 12 June 2026).
Mirhaghi, A.; Heydari, A.; Mazlom, R.; Hasanzadeh, F. Reliability of the Emergency Severity Index: Meta-analysis. Sultan Qaboos Univ. Med. J. 2015, 15, e71–e77. [Google Scholar] [CrossRef]
Hosseini, A.; Serag, A. Is synthetic data generation effective in maintaining clinical biomarkers? Investigating diffusion models across diverse imaging modalities. Front. Artif. Intell. 2025, 7, 1454441. [Google Scholar] [CrossRef] [PubMed]
OpenAI. GPT-5 Models Documentation. OpenAI Platform. 2025. Available online: https://platform.openai.com/docs/models (accessed on 2 June 2026).
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2021, 7, 535–547. [Google Scholar] [CrossRef]
Singh, A.; Ehtesham, A.; Kumar, S.; Khoei, T.T. Agentic retrieval-augmented generation: A survey on agentic RAG. arXiv 2025, arXiv:2501.09136. [Google Scholar]
U.S. Food and Drug Administration. FDA Adverse Event Reporting System (FAERS) Public Dashboard. 2023. Available online: https://www.fda.gov/drugs/fdas-adverse-event-reporting-system-faers/fda-adverse-event-reporting-system-faers-public-dashboard (accessed on 1 April 2025).
Sackett, D.L.; Rosenberg, W.M.C.; Gray, J.A.M.; Haynes, R.B.; Richardson, W.S. Evidence based medicine: What it is and what it isn’t. BMJ 1996, 312, 71–72. [Google Scholar] [CrossRef] [PubMed]
Kawakami, W.; Suzuki, K.; Iwasawa, J. Stabilizing reasoning in medical LLMs with continued pretraining and reasoning preference optimization. arXiv 2025, arXiv:2504.18080. [Google Scholar]
Chen, J.; Cai, Z.; Ji, K.; Wang, X.; Liu, W.; Wang, R.; Hou, J.; Wang, B. HuatuoGPT-o1, towards medical complex reasoning with LLMs. arXiv 2024, arXiv:2412.18925. [Google Scholar]
Xiong, G.; Yang, Z.; Yi, J.; Wang, N.; Wang, L.; Zhu, H.; Wu, C.; Lu, A.; Chen, X.; Liu, S.; et al. DDInter: An online drug–drug interaction database towards improving clinical decision-making and patient safety. Nucleic Acids Res. 2022, 50, D1200–D1207. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Composition of the evaluation cohort (

N = 1000

) as a Sankey diagram, flowing from ground-truth ESI tier through designated routing pathway to documented safety profile; ribbon widths are proportional to case counts. Tiers are balanced by construction, and non-emergency cases are stratified near-uniformly across the four medication/allergy combinations to exercise every safety pathway.

Figure 1. Composition of the evaluation cohort (

N = 1000

) as a Sankey diagram, flowing from ground-truth ESI tier through designated routing pathway to documented safety profile; ribbon widths are proportional to case counts. Tiers are balanced by construction, and non-emergency cases are stratified near-uniformly across the four medication/allergy combinations to exercise every safety pathway.

Figure 2. End-to-end architecture of the agentic TriAgent system. The Orchestrator Agent dynamically invokes composable specialist modules for urgency assessment, similar-case retrieval, treatment drafting, medication safety, allergy risk assessment, policy review, documentation, and case-history recall. Critical presentations activate an emergency bypass pathway. Non-critical outputs pass through a downstream Critique Agent before finalization. Dashed arrows indicate the clinician feedback loop through Decision Memory.

Figure 3. Representative execution cases produced by adaptive orchestration. (A) Critical case with immediate escalation. (B) Medication-safety pathway only. (C) Parallel medication and allergy safety pathway. Each recommended action carries a two-axis tag (DDI and allergy). A value of cleared means screened with nothing flagged; caution flagged means an interaction or cross-reactivity was identified and managed, with the action still recommended; n/a means no medications or allergies are on record.

Figure 4. Representative Critique Agent ablation. (A) With critique enabled, remediation and pharmacist review are triggered. (B) Without critique, uncertainty is not escalated into the final recommendation.

Table 1. Per-class triage classification performance on the real-ESI cohort (

N = 1000

).

Table 1. Per-class triage classification performance on the real-ESI cohort (

N = 1000

).

	Precision	Recall	F1
Critical	0.72	0.85	0.78
Urgent	0.54	0.55	0.54
Routine	0.72	0.57	0.64

Table 2. Triage comparison across system configurations on the same 1000 cases (same language model, real ESI labels). All systems receive the patient narrative plus the matched structured triage inputs (vitals + arrival mode). 95% confidence intervals from a paired bootstrap (2000 resamples); critical recall is the fraction of ground-truth ESI-1 cases assigned to the critical tier. Bold values indicate the best-performing configuration in each column.

System	Overall Accuracy	95% CI	Crit. Recall
Standalone LLM	43.4%	$[40.3, 46.5]$	14.7%
Traditional RAG	42.2%	$[39.2, 45.3]$	14.7%
TriAgent (Agentic AI)	65.7%	$[62.9, 68.5]$	85.0%

Table 3. Per-case latency by routing pathway. The orchestrator issues a median of 7 tool calls per case and triggers a critique remediation round-trip in

38.2 %

of instrumented continuation cases (Section 4.10).

Table 3. Per-case latency by routing pathway. The orchestrator issues a median of 7 tool calls per case and triggers a critique remediation round-trip in

38.2 %

of instrumented continuation cases (Section 4.10).

Routing Pathway	N	Median (s)	p90 (s)
Emergency bypass	216	64	86
Continuation	784	186	268

Table 4. Retrieval quality: single-pass versus iterative retrieval.

Metric	Single-Pass	Iterative
Mean relevance score (0–1)	0.61	0.79
Cases above threshold (%)	71.2	96.7
Cases triggering rewrite	–	312
Resolved after rewrite (%)	–	89.1

Table 5. Validation of the D3 drug-drug interaction classifier against DDInter [46].

Severity Tier	Sensitivity (%)	Specificity (%)	F1
Major interactions	96.2	94.8	0.95
Moderate interactions	88.4	91.3	0.89
Minor interactions	79.6	89.7	0.84
Macro average	88.1	91.9	0.89
False positive rate (%)	5.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ibrahim, A.; AlSanousi, A.; Serag, A. TriAgent: An Adaptive Multi-Agent Architecture for Crisis Clinical Decision Support Under Incomplete Information. AI 2026, 7, 230. https://doi.org/10.3390/ai7060230

AMA Style

Ibrahim A, AlSanousi A, Serag A. TriAgent: An Adaptive Multi-Agent Architecture for Crisis Clinical Decision Support Under Incomplete Information. AI. 2026; 7(6):230. https://doi.org/10.3390/ai7060230

Chicago/Turabian Style

Ibrahim, Ahmed, Ali AlSanousi, and Ahmed Serag. 2026. "TriAgent: An Adaptive Multi-Agent Architecture for Crisis Clinical Decision Support Under Incomplete Information" AI 7, no. 6: 230. https://doi.org/10.3390/ai7060230

APA Style

Ibrahim, A., AlSanousi, A., & Serag, A. (2026). TriAgent: An Adaptive Multi-Agent Architecture for Crisis Clinical Decision Support Under Incomplete Information. AI, 7(6), 230. https://doi.org/10.3390/ai7060230

Article Menu

TriAgent: An Adaptive Multi-Agent Architecture for Crisis Clinical Decision Support Under Incomplete Information

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Dataset

3.2. System Overview

3.3. Orchestrator Agent

3.4. Composable Specialist Modules

3.4.1. Reasoning and Planning Modules

3.4.2. Safety and Knowledge Modules

3.4.3. Memory and Documentation Modules

3.5. Triage Classification Head

3.6. Critique Agent

3.7. Safety-Oriented Mitigations

3.7.1. Worst-Case Differential Enumeration

3.7.2. Elimination of Passive Holds

3.7.3. Deterministic Medication Guardrails

3.8. Operational Execution

3.9. Evaluation Protocol

3.9.1. System-Level Assessment

3.9.2. Component-Level Assessment

4. Results

4.1. Orchestration

4.2. Triage Classification Performance

4.3. Comparison with Alternative System Configurations

4.4. Expert Clinical Evaluation

4.5. Mitigation Effects at Scale

4.6. Efficiency and Runtime Performance

4.7. Reproducibility, Prompt Robustness, and Backbone Sensitivity

4.7.1. Run-to-Run Variability

4.7.2. Backbone Sensitivity

4.7.3. Prompt Robustness

4.8. Retrieval Evaluation

4.9. Drug Safety Validation

4.10. Critique Agent Contribution

5. Discussion

6. Conclusions

7. Declaration of Generative AI and AI-Assisted Technologies in the Manuscript Preparation Process

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Evaluation Cohort Composition

Appendix B. Structured Audit Trace Schema

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI