Next Article in Journal
A Novel Multi-Server Federated Learning Framework in Vehicular Edge Computing
Previous Article in Journal
Using the Zero Trust Five-Step Implementation Process with Smart Environments: State-of-the-Art Review and Future Directions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TriagE-NLU: A Natural Language Understanding System for Clinical Triage and Intervention in Multilingual Emergency Dialogues

by
Béatrix-May Balaban
1,
Ioan Sacală
1,* and
Alina-Claudia Petrescu-Niţă
2
1
Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica of Bucharest, Splaiul Independentei, No. 313, 060042 Bucharest, Romania
2
Faculty of Applied Sciences, National University of Science and Technology Politehnica of Bucharest, Splaiul Independentei, No. 313, 060042 Bucharest, Romania
*
Author to whom correspondence should be addressed.
Future Internet 2025, 17(7), 314; https://doi.org/10.3390/fi17070314
Submission received: 25 June 2025 / Revised: 10 July 2025 / Accepted: 17 July 2025 / Published: 18 July 2025
(This article belongs to the Section Big Data and Augmented Intelligence)

Abstract

Telemedicine in emergency contexts presents unique challenges, particularly in multilingual and low-resource settings where accurate, clinical understanding and triage decision support are critical. This paper introduces TriagE-NLU, a novel multilingual natural language understanding system designed to perform both semantic parsing and clinical intervention classification from emergency dialogues. The system is built on a federated learning architecture to ensure data privacy and adaptability across regions and is trained using TriageX, a synthetic, clinically grounded dataset covering five languages (English, Spanish, Romanian, Arabic, and Mandarin). TriagE-NLU integrates fine-tuned multilingual transformers with a hybrid rules-and-policy decision engine, enabling it to parse structured medical information (symptoms, risk factors, temporal markers) and recommend appropriate interventions based on recognized patterns. Evaluation against strong multilingual baselines, including mT5, mBART, and XLM-RoBERTa, demonstrates superior performance by TriagE-NLU, achieving F1 scores of 0.91 for semantic parsing and 0.89 for intervention classification, along with 0.92 accuracy and a BLEU score of 0.87. These results validate the system’s robustness in multilingual emergency telehealth and its ability to generalize across diverse input scenarios. This paper establishes a new direction for privacy-preserving, AI-assisted triage systems.

1. Introduction

The need for multilingual and clinically grounded natural language understanding is becoming increasingly urgent as telemedicine becomes a central pillar of modern emergency care. In delicate medical situations, such as stroke, cardiac arrest, or trauma, where every minute matters, frontline responders and remote clinicians must coordinate quickly and effectively. However, such coordination is often delayed by language barriers, unstructured or noisy dialogue, and the absence of systems that can reliably interpret medical intent, urgency, and intervention needs from conversations.
While natural language processing (NLP) [1,2] has achieved strong performance in clinical tasks [3] such as named-entity recognition [4], summarization [5], and question answering [6], these systems are typically trained on retrospective clinical notes or structured datasets. They are not designed for multilingual dialogue in high-stakes contexts. Prior works in clinical NLP have focused primarily on medical dialogue using a prompt structure [7], while others explored federated model challenges and opportunities in the healthcare domain [8]. Furthermore, most existing models lack grounding in clinical intervention protocols or adaptability to regional practices, making instant deployment unreliable.
Clinical natural language processing has traditionally focused on structured document analysis, including tasks like named-entity recognition (NER), de-identification, and concept normalization from clinical notes. More recent methods leverage deep learning to improve performance on tasks like relation extraction [9] and concept linking [10]. However, these systems are designed for retrospective, monolingual, and often static documents, lacking the capacity for fast interpretation or action-oriented dialogue understanding in emergencies.
Task-oriented dialogue systems for healthcare have seen increasing interest, with systems supporting medication adherence [11], mental health screening [12], and patient triage via chatbots [13]. Some studies model medical dialogue as a sequence-to-sequence problem, while others build slot-filling or question-answering systems grounded in medical ontologies [14]. Yet, few have addressed the specific challenges of instantaneous, multi-role, and high-stakes emergency teleconsultation scenarios.
Clinical decision support systems (CDSSs) traditionally rely on rule-based engines or structured clinical inputs [15]. More recent efforts incorporate neural models for suggesting treatments or interventions [16], but these often depend on EHR (Electronic Health Record) inputs. Few systems interpret natural language as a direct signal for triggering intervention, especially in live emergency contexts. TriagE-NLU differs by mapping parsed utterances to time-critical actions such as activating stroke or trauma protocols, grounded in both dialogue and patient metadata.
Federated learning (FL) has emerged as a privacy-preserving paradigm for training models across decentralized institutions without data sharing. In health contexts, FL has been applied to medical imaging [17] and predictive modeling [18]. However, NLP for emergency settings presents additional challenges in terms of synchronization, adaptation, and the trustworthiness of models across various telemedicine deployments.
Cross-lingual transfer and multilingual models have been explored for general NLP tasks. Nevertheless, emergency care introduces linguistic variation, code-switching, and cultural differences in symptom reporting. There is a pressing need for models that can generalize under such conditions and perform semantic understanding even when resources or parallel data are limited.
Recent advancements in multilingual clinical NLP include systems such as mT5 [19], mBART [20], and XLM-RoBERTa [21], which have enabled cross-lingual modeling of healthcare text. However, these systems remain limited in their ability to jointly model both semantic understanding and intervention planning, especially in low-resource or multilingual clinical settings. For example, GatorTron [22] offers high performance in English-language EHRs but is not designed for dialogue-based input. Other tools, such as MedDG [23], focus on generative dialogue but lack structured intervention classification. To date, no comprehensive system integrates multilingual dialogue parsing, patient context, and clinical action mapping within a privacy-aware framework.
We hypothesize that a multilingual NLP system trained on structured emergency input data can accurately map clinical utterances to intervention decisions across languages and resource settings.
To address these challenges, this paper introduces TriagE-NLU, a semantic understanding system tailored for emergency telemedicine triage and intervention support. The name TriagE-NLU comes from combining two key concepts central to the paper: TriagE refers to medical triage, the clinical process of prioritizing patients based on the urgency of their conditions. This reflects the system’s purpose: to understand medical dialogues and support urgent assessment and intervention decisions in emergency telemedicine. On the other hand, NLU stands for natural language understanding, a subfield of natural language processing (NLP) that focuses on machine understanding of human language.
Put together, TriagE-NLU designates a natural language understanding system for clinical triage capable of interpreting multilingual medical dialogues, assessing urgency, and recommending interventions based on a structured semantic framework.
TriagE-NLU enables structured understanding of multilingual conversations between patients and remote clinicians. It fuses natural language with patient metadata (e.g., age, known conditions, symptom onset), contextual cues (e.g., temporal markers, urgency expressions), and ontology-grounded semantics to recommend appropriate clinical interventions. Built on a federated learning framework, the system continuously improves through privacy-preserving updates across institutions.
This paper makes the following key contributions:
  • We develop TriagE-NLU, a multilingual, patient-aware natural language understanding system that combines semantic parsing and intervention classification for emergency telemedicine.
  • We introduce EUREKA (Emergency Understanding Representation for Evidence-based Knowledge and Action). This unified semantic schema encodes symptoms, onset patterns, risk factors, and inferred urgency to map patient dialogue to structured intent.
  • We construct TriageX, the first synthetic, clinically annotated dataset for multilingual emergency dialogue understanding, and train our model using federated averaging (FedAvg) to support privacy-preserving, cross-institutional deployment.
Together, these components establish a new foundation for NLP in emergency medicine, shifting from retrospective analysis to dynamic, immediate clinical reasoning that supports equitable and timely care.

2. Related Work

Multilingual natural language understanding (NLU) in healthcare has seen significant advancements through the adoption of transformer-based models that enable cross-lingual generalization and multitask learning. However, most of this progress has been centered around retrospective clinical narratives, such as discharge summaries or EHR notes, rather than dialogue or interactive decision support. Furthermore, few of the existing systems can perform joint semantic parsing and intervention recommendation, especially under multilingual and low-resource constraints.
The mT5-Base model [19] has been widely adopted for multilingual sequence-to-sequence tasks. As a fully text-to-text architecture pre-trained on the mC4 corpus across more than 100 languages, mT5 is suitable for semantic parsing tasks such as converting utterances into structured clinical frames.
The mBART model, designed as a multilingual denoising autoencoder, supports language generation across over 25 languages. When combined with a multilayer perceptron (MLP) classifier [24], mBART can encode multilingual dialogue and output intervention labels. This hybrid setup is effective for classification tasks in resource-constrained languages and was used in our benchmarks for intervention prediction. However, the system treats each utterance independently, with no explicit mapping to underlying symptom structures or temporal markers. It lacks interpretability and does not utilize a structured representation that reflects patient risk or clinical decision logic.
XLM-RoBERTa [21], paired with a BiLSTM classifier, has been employed for sequence labeling tasks such as slot-filling and medical named-entity recognition (NER) [25]. Its strong multilingual representation capabilities make it suitable for token-level symptom detection. The model does not include contextual metadata like patient demographics or time references and is limited to basic, surface-level extraction.
In contrast to these approaches, TriagE-NLU introduces a unified architecture that integrates semantic parsing with clinical intervention classification, grounded in a structured, ontology-aligned representation of emergency dialogue. It processes not only text transcriptions but also multimodal inputs such as patient metadata and temporal context. The model incorporates a hybrid rules-and-policy decision engine informed by clinical guidelines and supports privacy-preserving training through federated learning. These capabilities allow it to generalize across languages, adapt to local protocols, and provide interpretable and medically actionable outputs in telemedicine deployments.

3. Proposed TriagE-NLU System

TriagE-NLU represents a multimodal semantic understanding system designed for processing telemedicine dialogue during emergency scenarios. It integrates textual dialogue, patient metadata, and contextual signals into a unified pipeline that produces structured semantic representations and clinical action recommendations. The system is designed to be modular, privacy-preserving, and generalizable across languages and regional healthcare protocols. Figure 1 illustrates the proposed model.
The TriagE-NLU system ingests multimodal input (speech, text, metadata) and processes it using Python 3.11.9 and SentencePiece 0.2.0 for language tagging and tokenization. Tokenized input is passed to an mT5-Base transformer (via Hugging Face Transformers), which performs semantic parsing using the EUREKA frame schema and classifies intervention labels. The model is trained on the synthetic, multilingual TriageX dataset and updated via a federated learning loop with secure aggregation. Output is validated using constrained decoding and a slot ontology, then deployed on CUDA-enabled infrastructure and evaluated using Slot F1, Frame F1, and Macro F1 metrics. Slot F1 measures the accuracy of individual field extractions (e.g., symptoms, onset time), Frame F1 evaluates exact matches of full semantic frames, Macro F1 averages performance across all classes equally, and the multilingual metric assesses consistency across languages.
The core architecture of TriagE-NLU consists of four main components.
  • Multimodal Input Processing: The system accepts heterogeneous input streams, including transcribed input speech or dialogue between patients and clinicians, structured patient metadata (e.g., age, pre-existing conditions, vital signs), and temporal and contextual markers (e.g., symptom onset duration, location metadata).
  • Semantic Parsing Engine: This module leverages multilingual transformer models fine-tuned for low-resource clinical contexts. It maps utterances and contextual signals to a structured representation based on a new proposed EUREKA schema (Emergency Understanding Representation for Evidence-based Knowledge and Action), which will be described in Section 4, that encodes symptom mentions, temporal qualifiers, patient risk factors, and urgency indicators.
  • Intervention Grounding Module: Parsed semantic frames are fed into a decision engine that aligns recognized medical states with appropriate interventions using a rules-and-learned-policy hybrid model informed by clinical guidelines (e.g., ACLS, WHO triage protocols). For example, a combination of “slurred speech”, “unilateral weakness”, and “<10 min onset” is mapped to “stroke protocol activation”.
  • Federated Learning Update Loop: Each deployment instance of TriagE-NLU (e.g., in a regional hospital network) trains on local usage patterns while sharing model updates through secure aggregation. This ensures continuous adaptation without exposing sensitive clinical dialogue, maintaining both data locality and generalization capabilities.
In high-level terms, TriagE-NLU translates unstructured multilingual interaction into structured semantic meaning and uses it to drive urgent decision support. It is robust to input variability and informal symptom descriptions, making it suitable for deployment in diverse telemedicine environments.
Presenting a use case scenario, Figure 2 illustrates the TriagE-NLU architecture in action, showing how patient speech, metadata, and symptom onset are processed through a multimodal input module and semantic parsing engine. This structured information triggers a clinically informed intervention (e.g., stroke protocol activation) and feeds into a federated learning loop for secure model improvement across deployments.

Novelty Positioning

This paper introduces a novel system, TriagE-NLU, that advances computational linguistics in the context of fast telemedicine triage. While NLP has seen extensive progress in areas such as clinical note parsing, intent classification, and dialogue systems, existing approaches fall short in several dimensions critical for high-stakes emergency interactions. Table 1 emphasizes the unique positioning of TriagE-NLU within the current landscape of clinical NLP and telemedicine research.
We highlight the primary axes of originality below.
  • Patient-Aware Semantic Understanding: Unlike prior work focused on retrospective clinical texts, our system models spontaneous speech or dialogue between patients and clinicians. It fuses natural language with structured patient data (e.g., age, onset time, and vital signs), enabling patient-specific semantic parsing in time-sensitive emergency settings.
  • Grounding Semantic Representations in Interventions: To our knowledge, TriagE-NLU is the first to bridge high-level semantic parsing with clinically actionable output. While prior systems have modeled symptom classification or medical Q&A, none have mapped parsed utterances to structured intervention suggestions in multilingual telemedicine contexts.
  • Multilingual and Low-Resource Triage Generalization: Most existing systems are monolingual or assume idealized input. We propose a multilingual, low-resource capable system that performs fast semantic parsing across languages, using weak supervision and cross-lingual transfer. This enables deployment in under-resourced regions where emergency response support is critically needed.
  • Federated and Privacy-Preserving Learning: TriagE-NLU integrates federated learning for continual improvement across decentralized clinical environments. This addresses key privacy and legal barriers to training on sensitive conversations while allowing regional adaptation to medical protocols, linguistic variance, and cultural expression of symptoms.
  • Absence of Existing Benchmarks: We introduce TriageX, a benchmark dataset of simulated multilingual emergency teleconsultation inputs. While dialogue datasets exist in general-purpose or open-domain contexts, there is no publicly available resource that combines medical urgency, multiple languages, and structured intervention annotation, establishing a new benchmark task for understanding semantic medical triage.

4. Semantic Representation: EUREKA Schema Proposal

To enable intervention-aware understanding of emergency telemedicine dialogue, we propose the EUREKA schema (Emergency Understanding Representation for Evidence-based Knowledge and Action). EUREKA is a structured, interpretable semantic representation that captures clinical relevance from spontaneous, multilingual, and multi-role conversations. It integrates utterance-level intent, patient-level context, and triage-level urgency into a unified format suitable for downstream reasoning and intervention recommendation.

4.1. Design Goals

The EUREKA schema was designed with four primary goals in mind. First, interpretability was prioritized to ensure that the structured outputs are human-readable, allowing clinicians to verify and act upon them easily. Second, the schema emphasizes expressiveness, enabling it to capture nuanced clinical meanings such as temporal qualifiers (e.g., “since last night”), compound symptoms (e.g., “chest pain with shortness of breath”), and context-specific urgency markers. Third, multilingual compatibility was considered essential to abstract language-specific phrasing into consistent conceptual slots, ensuring robustness across diverse linguistic inputs. Lastly, the schema supports intervention alignment, allowing the parsed structures to be directly mapped to clinical decision trees and actionable protocols.

4.2. Schema Structure

The EUREKA representation is structured across four main tiers that work together to model the semantics of emergency medical input. The first is the Symptom and Condition Layer, which extracts individual symptoms or medical concepts such as “speech.slurred”, “pain.chest”, or “limb.weakness.right”. The second is the Temporal and Onset Layer, responsible for capturing time expressions and the recency of symptoms, including indicators like “onset_time = <10 min>” or “symptom_duration = ‘since yesterday’”. The third tier, the Risk and Metadata Layer, embeds relevant patient context such as age and pre-existing conditions while also incorporating an automatically inferred risk_level. Finally, the Urgency and Action Layer determines triage urgency and links the semantic frame to appropriate clinical interventions, such as “action = activate_stroke_protocol”.

4.3. Parsing Workflow

Statements from multilingual speech/text are processed by a transformer-based semantic parser that identifies relevant tokens, aligns them to schema slots using constrained decoding, and integrates structured patient data to fill or refine entries. Temporal reasoning is supported through lightweight rule-based modules and clinical heuristics. Figure 3 illustrates the workflow of the EUREKA scheme.
Here is an example of the parsing. Given the input: “My mother can’t speak properly and her right arm is numb. It just started.”, EUREKA parses this into the following:
symptoms = {speech.slurred, limb.weakness.right},
onset_time = <5 min>,
risk = high,
action = activate_stroke_protocol
EUREKA enables fine-grained semantic abstraction while preserving alignment with clinical realities. It ensures that NLP outputs from TriagE-NLU are both machine-actionable and human-verifiable, essential for deployment in safety-critical settings like telemedicine triage.

5. Training Methodology

The TriagE-NLU framework is trained through a combination of supervised, weakly supervised, and federated learning strategies. This hybrid training paradigm enables high performance in multilingual, low-resource clinical settings while ensuring compliance with privacy and data governance constraints in healthcare deployments. Figure 4 presents an overview of the proposed training methodology.

5.1. Data Preparation

The data used in this study were generated synthetically to simulate multilingual emergency teleconsultation dialogues. No real patient data or teleconsultation transcripts were collected, and therefore, no ethical approval was required. The simulated dialogues were constructed using Python 3.11.9 scripts that systematically combine predefined medical templates, symptom lists, temporal expressions, and demographic metadata. These templates were inspired by clinical scenarios commonly encountered in emergency medicine (e.g., stroke, asthma, myocardial infarction) and encoded in five target languages: English, Spanish, Romanian, Arabic, and Mandarin.
Each synthetic sample consists of a text input between a virtual clinician and a patient or caregiver. The dialogue includes naturalistic utterances aligned with structured medical information, such as symptoms, onset times, comorbidities, and risk factors. The corresponding ground truth annotations were programmatically generated alongside each sample using the proposed EUREKA schema, allowing automatic production of semantically structured representations. Intervention labels were assigned similarly based on deterministic mapping rules that link recognized symptom patterns with established triage protocols (e.g., activating the stroke protocol when slurred speech and unilateral arm weakness occur within 30 min of onset).
To enhance linguistic variability, multilingual text augmentation strategies were applied to introduce variation in phrasing, word order, and use of regional expressions. While no real teleconsultation recordings were used, the synthetic corpus was designed to preserve clinical plausibility and cover a diverse range of urgent care scenarios. This approach enables large-scale, multilingual training without compromising privacy or data protection standards.
The criteria for weakly labeled multilingual data generation and cross-lingual projection in our approach begin with the construction of annotated samples in English using predefined symptom templates aligned with the EUREKA schema. These annotations, including symptom spans and urgency indicators, are then projected to the target languages by preserving their semantic roles through deterministic inheritance. To ensure accuracy, token-level alignment techniques, such as attention maps from multilingual transformer models or external alignment tools, were applied to map the English labels to corresponding tokens in the translated text. Additionally, language-specific rules were implemented to correct syntactic variations, such as word order and prepositional structures, thereby maintaining grammatical and semantic consistency. Although the data remains fully synthetic, a portion of each generated dataset underwent manual spot-checking to validate linguistic plausibility and annotation quality. This process ensures the reproducibility and structural consistency of multilingual annotations across languages.
The resulting dataset included fully synthetic dialogues built from clinical templates and encoded triage rules, as well as weakly labeled multilingual samples projected from English using rule-based extractors and cross-lingual mapping. Patient-specific metadata, such as age, comorbidities, and vital signs, was programmatically injected as auxiliary features. All inputs were normalized and tokenized using subword tokenization compatible with multilingual transformer models, ensuring consistency across languages and modalities.

5.2. Semantic Parsing Model

We fine-tune a multilingual T5 model (mT5-Base) using a sequence-to-sequence formulation. Input tokens consist of serialized dialogue and patient features, while outputs are linearized EUREKA schema instances. We apply constrained decoding to enforce type and slot consistency.
TriagE-NLU utilizes the multilingual T5 model (mT5-Base) fine-tuned to perform semantic parsing of emergency teleconsultation dialogues into structured EUREKA frames. The input to the model is a dialogue transcript, and the output is a serialized EUREKA slot structure representing the extracted clinical semantics. The fine-tuning was conducted on the synthetic and weakly labeled multilingual data described in Section 5.1 using a text-to-text paradigm.
To improve parsing consistency and ensure adherence to the structured EUREKA schema, we implemented a slot-constrained decoding mechanism during inference. This mechanism restricts the model’s output space to valid slot sequences and slot values based on pre-defined constraints. These constraints were defined through a set of deterministic rules that enforce schema compliance and semantic correctness.
The development of decoding constraints followed a rule-based specification informed by the design of the EUREKA annotation schema. For instance, slots such as symptom, onset_time, urgency, and intervention were each associated with specific constraints. These included a closed or semi-closed list of acceptable values (such as “chest pain”, “slurred speech”, or “30 min”), type-specific formatting rules (for example, expressing onset time in minutes or hours, or using Boolean indicators for presence or absence), and logical dependency rules that enforce clinical consistency, such as requiring a high urgency level to be inferred when both slurred speech and unilateral weakness are present.
The decoding rules were validated through simulation and backtesting on a subset of the synthetic dataset where the expected output structure was known. During decoding, if the model attempted to produce invalid or incomplete slot-value pairs (e.g., an unrecognized symptom or malformed time expression), the decoder would dynamically adjust the output beam to select a valid alternative, prioritizing candidates that matched allowable slot configurations.
An example English input is as follows:
Patient: “She suddenly couldn’t speak and couldn’t move her right arm. It’s been about 15 min.”
The model, aided by constrained decoding, generated the following valid frame: symptoms “slurred speech”, “right arm weakness”, onset time “15 min”, urgency “critical” and intervention “activate stroke protocol”.
Without slot constraints, the model sometimes produced inconsistent frames, such as “symptoms”: “arm problem” or omitted essential fields, which compromised usability. The reduced constraints have the role of reducing errors. In summary, slot-constrained decoding ensures structural fidelity, prevents schema violations, and allows reliable use of mT5 for multilingual clinical parsing across the diverse linguistic inputs targeted by TriagE-NLU.

5.3. Intervention Classification Model

Following the EUREKA-based semantic parsing described in Section 5.2, we implement a lightweight intervention classification module that operates as a secondary head over the multilingual T5 encoder. This component predicts an appropriate medical response (e.g., “dispatch ambulance”, “activate stroke protocol”) based on the encoder’s representation of the dialogue and structured patient context.
The classifier is implemented as a two-layer feedforward neural network (multi-layer perceptron) attached to the encoder output of the mT5 model. Specifically, the hidden state corresponding to the first token in the encoder’s final hidden layer is extracted (serving as a pooled representation akin to a [CLS] token) and passed through a linear layer with 128 hidden units, followed by an ReLU activation and a final linear layer projecting to the number of intervention classes. This pooled vector captures the global context of the input dialogue and is used to infer the appropriate intervention decision.
The model is trained using a supervised classification objective with cross-entropy loss and optimized using the Adam optimizer with a learning rate of 1 × 10−4. A dropout probability of 0.1 is applied during training to mitigate overfitting, and inverse-frequency class weights are used to address label imbalance among intervention categories. While L2 regularization and early stopping are not employed in the current setup, the model exhibits stable convergence within 20 epochs when trained on synthetic and weakly labeled data using a batch size of 64.
At inference time, the classifier operates in parallel with the mT5 text generation pathway. After the encoder processes the dialogue input, the classifier predicts the intervention label independently of the generated EUREKA frame. This design preserves modularity and supports the use of constrained decoding to ensure schema compliance for the structured output. This two-stage design supports multilingual instant triage by enabling the model to provide both a structured semantic frame and an immediate intervention recommendation. Moreover, decoupling parsing from classification enhances interpretability and enables independent refinement or replacement of either module.

5.4. Federated Learning Protocol

To support distributed clinical deployments, TriagE-NLU incorporates a federated learning protocol. Each client (e.g., regional clinic or telemedicine center) trains the model locally on anonymized usage data. Gradients are aggregated securely via federated averaging (FedAvg), with differential privacy mechanisms applied to preserve patient confidentiality. Clients may share partial semantic representations (not raw text) as part of an adaptive knowledge distillation framework.
In our proposal, we applied the FedAvg algorithm to enable decentralized training of the TriagE-NLU model across multiple clinical sites. Each participating node (e.g., hospital, telemedicine center) trains its local instance of the model using in-house anonymized data. Instead of transmitting raw patient dialogue or sensitive metadata, only model updates (e.g., gradients or weights) are periodically shared with a central aggregator. The server then performs FedAvg by computing the weighted average of these updates and distributes the global model back to the clients. To strengthen privacy, we incorporated differential privacy techniques and allowed optional exchange of compressed semantic representations, such as EUREKA frames, instead of raw textual data, aligning with an adaptive knowledge distillation approach. This setup ensures data locality, enhances generalizability across institutions, and preserves clinical confidentiality in multilingual direct environments.
To maintain data confidentiality in cross-institutional deployments, TriagE-NLU incorporates a layered privacy-preserving design within its federated learning (FL) protocol. Each participating institution, such as a hospital or telemedicine center, trains a local instance of the TriagE-NLU model using only synthetic data stored onsite. At no point are raw patient dialogues or identifiable metadata transmitted externally.
Prior to sharing model updates, such as gradient deltas or weight changes, each client applies differential privacy techniques by injecting calibrated noise, which protects model inversion or membership inference attacks that could otherwise compromise input reconstruction. To further ensure the confidentiality of shared information, TriagE-NLU adopts secure aggregation protocols, allowing the central server to compute a global model from encrypted updates without accessing any institution’s raw contributions. All communications between client nodes and the central aggregator will be encrypted using standard protocols, such as Transport Layer Security (TLS), to guarantee the integrity and confidentiality of transmitted data.
Model updates and semantic representations, when shared, are encrypted both in transit and at rest, using public-key cryptography to ensure that only authorized parties can decrypt aggregate results. Participation in the FL network is controlled through federated identity management and access controls, ensuring that only authenticated endpoints contribute to training and receive model updates. The system is designed to support integration with institutional role-based access mechanisms and to maintain audit logs to ensure traceability and compliance with data governance policies. Instead of exchanging textual inputs, TriagE-NLU supports optional transmission of structured and anonymized semantic frames, such as EUREKA representations, which limit semantic leakage and reinforce data minimization. Although no real clinical data is used in this study, the FL design of TriagE-NLU aligns with data protection frameworks such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA). This comprehensive approach enables the system to operate ethically and securely across diverse, distributed clinical environments.
In our experimental simulation, we emulate federated learning across 5–10 clients per round, with local training conducted for one epoch before synchronizing with the central aggregator. To ensure robustness in asynchronous environments, TriagE-NLU includes safeguards for handling stragglers and faulty updates. Specifically, clients that fail to return updates within a specified timeout window are excluded from the current round without halting training. Additionally, incoming updates are verified against integrity checks, and abnormally deviating gradients are downweighted to mitigate the impact of malicious or corrupted contributions. This ensures that the global model evolves reliably even in partially connected or error-prone deployments.

5.5. Cross-Lingual Alignment and Adaptation

TriagE-NLU is designed to operate across linguistically diverse environments, enabling semantic parsing and triage prediction in multiple languages. To facilitate cross-lingual generalization, we leverage the mT5-Base model, a multilingual sequence-to-sequence transformer pretrained on a large corpus of over 100 languages. This language-agnostic pretraining provides robust encoder representations that support transfer learning in both high-resource and low-resource language settings.
A key component of our cross-lingual strategy is the use of a shared EUREKA slot ontology, which defines a standard set of clinical entities and attributes (e.g., symptoms, onset time, urgency, comorbidities, interventions) in a language-independent manner. All training samples, regardless of language, are mapped to this unified schema. This shared structure ensures that the model learns a consistent semantic frame across different linguistic inputs, reducing language-specific variability in output formatting and facilitating transfer.
During training, we first construct annotated examples in English and then project these annotations to other target languages through translation and alignment. To maintain semantic integrity during projection, we employ token-level alignment using attention maps and heuristic rules that account for language-specific syntax. These aligned labels are used to train the multilingual T5 model to generate EUREKA-compliant frames across all supported languages.
The EUREKA schema itself serves as a cross-lingual anchor, enabling the model to generalize from syntactic surface forms to more profound clinical semantics. This design is particularly effective in low-resource languages, where labeled data is limited, but the structured nature of the output provides a stable target. For instance, in Romanian and Arabic, languages with fewer pretraining resources, the model achieves only a minor degradation in frame accuracy.
We also observed that constrained decoding plays a stabilizing role in low-resource settings by enforcing output structure. Even when token-level errors occur due to noise in the input, the decoder frequently recovers valid EUREKA frames thanks to schema-aware slot constraints.
Most of the errors observed were due to mild misalignments in symptom phrasing or ambiguity in translated onset expressions, for example, in Romanian “aproximativ acum treizeci de minute” (“around thirty minutes ago”) being misinterpreted as a broader time range. Despite these minor discrepancies, the intervention classification accuracy remained stable across languages, with a deviation of less than 5% from the English baseline. Constrained decoding was particularly effective in these settings, often recovering valid EUREKA frames even when token-level noise or partial input loss occurred. For example, in the Arabic dialogue:
المريض يشعر بدوخة ولا يستطيع تحريك يده اليمنى منذ حوالي عشر دقائق.
(“The patient feels dizzy and has been unable to move his right hand for about ten minutes.”)
TriagE-NLU correctly produced symptoms (dizziness, right arm weakness), onset_time (10 min), urgency (critical) and intervention (activate stroke protocol).
Overall, TriagE-NLU achieves robust multilingual performance by combining language-agnostic encoder representations, shared schema supervision, and alignment-aware training techniques. This approach enables consistent clinical interpretation across diverse linguistic contexts, a critical requirement for scalable emergency triage systems.

5.6. Evaluation Strategy

We perform model selection using development sets stratified by language, patient profile, and triage scenario type. To ensure robustness, we evaluate generalization across unseen combinations of symptoms, languages, and metadata profiles. Detailed metrics and results are reported in Section 8.
This training framework balances the need for rich semantic understanding, deployment practicality, and data privacy, ensuring that TriagE-NLU remains both clinically useful and ethically reliable.
The TriagE-NLU prototype is implemented in Python 3.11.9. Details regarding the implementation steps for TriagE-NLU are provided in Appendix A.

6. Benchmark Construction: TriageX Dataset

To evaluate the performance of TriagE-NLU and promote further research in emergency telemedicine NLP, we introduce TriageX, the first multilingual, intervention-annotated dataset designed specifically for clinical dialogue understanding, consisting of a total 5000 entries of five different languages: English, Spanish, Romanian, Arabic, Mandarin.
To implement TriageX, a configurable Python 3.11.9 script was developed that programmatically generates multilingual, clinically grounded emergency dialogues using language-specific symptom templates, patient metadata, and intervention mappings. The script ensures natural linguistic variation and correct grammatical structure across the five languages, enabling scalable simulation of realistic telemedicine triage scenarios.

6.1. Motivation and Design Principles

Existing datasets in medical NLP primarily focus on retrospective clinical documentation or general-purpose medical question answering. TriageX is constructed to reflect the multilingual and high-stakes nature of emergency teleconsultation. It includes structured interventions and urgency annotations aligned with clinical protocols, making it suitable for both semantic parsing and downstream triage action prediction.

6.2. Data Sources and Simulation Framework

The TriageX dataset is fully synthetic and was constructed using parameterized Python 3.11.9 scripts that simulate multilingual emergency scenarios. Symptom templates were defined based on well-documented acute medical conditions such as stroke, myocardial infarction, asthma attacks, and diabetic crises. Each template includes a structured mapping between symptoms (e.g., slurred speech, chest pain, shortness of breath) and their corresponding EUREKA-coded semantic frames.
Intervention rules were crafted based on generalized international clinical guidelines, such as those from the World Health Organization (WHO), the American Heart Association (AHA), and established triage protocols. For example, the co-occurrence of “unilateral arm weakness” and “slurred speech” within a short temporal window (<30 min) was mapped to an intervention recommendation of “activate stroke protocol”. While no licensed medical professionals were involved in the dataset creation, the symptom-intervention logic was implemented deterministically, reflecting broadly accepted emergency response patterns.
To ensure ethical compliance, no real clinical data or personally identifiable information was used in TriageX. All symptoms and metadata were randomly generated using rule-based and probabilistic sampling methods. While no domain experts directly validated the data, the template design and decision logic were aligned with established emergency medical doctrine, ensuring clinically plausible and reproducible scenarios suitable for research in low-risk, simulation-driven settings.

6.3. Schema and Annotation Protocol

Each dialogue instance in TriageX is annotated with a structured EUREKA-formatted semantic frame, capturing the clinical content of the exchange. In addition, patient metadata and relevant linguistic contexts such as code-switching or the use of temporal phrases, are included to support nuanced interpretation. An intervention label is assigned to indicate the appropriate clinical response (e.g., “monitor remotely”, “dispatch ambulance”, or “activate stroke protocol”), along with an urgency level that classifies the severity of the situation as medium, urgent, or critical.
Figure 5 illustrates five entries from TriageX, in English (a), Spanish (b), Arabic (c), Mandarin (d), and Romanian (e).

6.4. Languages and Demographics

TriageX covers five major languages (English, Spanish, Romanian, Arabic, Mandarin) and includes demographic diversity across age, gender, comorbidity profiles, and regional expressions. This ensures robust cross-linguistic evaluation and applicability. Each language has 1000 entries.
To reflect demographic diversity, each dialogue instance includes simulated patient metadata with structured attributes such as age, gender, and comorbidity profiles. Ages are randomly assigned within a clinically relevant range, typically spanning from 30 to 90 years, and are used to modulate risk estimation (e.g., patients over 65 are marked as high-risk). Gender is evenly distributed between male and female patients to avoid bias in intervention prediction. Comorbidities are sampled from a curated set of prevalent chronic conditions, such as hypertension, diabetes, asthma, and stroke history, defined separately for each language to ensure cultural and clinical realism. In addition, relational roles (e.g., “my mother,” “my husband”) are culturally localized per language to simulate authentic caregiver narratives. This demographic and linguistic richness supports rigorous evaluation of model robustness across diverse patient profiles and language contexts, facilitating the development of fair and generalizable emergency triage systems.

6.5. Benchmark Tasks

We define three core benchmark tasks for evaluating TriagE-NLU and comparable systems. The first task, semantic parsing accuracy, measures the model’s ability to extract structured information from emergency dialogues by computing F1 scores at both the slot level and the frame level within the EUREKA schema. The second task focuses on intervention classification, assessing the system’s capacity to predict appropriate clinical actions by evaluating both overall accuracy and macro F1, which accounts for class imbalance across intervention types. The third task, multilingual generalization, evaluates the robustness of the system across diverse linguistic settings, including both zero-shot and few-shot performance in non-English languages. Together, these tasks provide a comprehensive framework for benchmarking multilingual clinical NLP systems in emergency scenarios. TriageX thus establishes a reproducible, linguistically diverse, and intervention-grounded evaluation protocol that reflects the demands of telemedicine triage.
In addition to defining benchmark tasks, we implemented a systematic evaluation strategy to assess model performance across parsing and classification. For semantic parsing, we computed slot-level F1 scores for each field in the EUREKA schema, providing a fine-grained view of the model’s ability to extract clinically relevant concepts. Frame-level accuracy was also evaluated, defined as the proportion of dialogue instances for which all required slots were correctly extracted. Intervention classification performance was assessed using both overall accuracy and macro-averaged F1 scores, ensuring a balanced evaluation across intervention categories despite class imbalance. These metrics were computed separately for each language to capture multilingual robustness and cross-linguistic variation. To evaluate zero-shot generalization, we excluded each target language from the training set and measured performance on held-out test data without fine-tuning. All metrics were computed using standard implementations from the scikit-learn library to ensure reproducibility and transparency.

7. Results and Performance Analysis

7.1. Use Case: Rural Telemedicine Stroke Suspect Case

In this subsection it is described how the TriagE-NLU would analyze and respond to a stroke suspect scenario in a rural telemedicine setting. It highlights how TriagE-NLU interprets speech, enriches it with patient metadata, parses relevant clinical indicators, and outputs a structured summary along with a recommended intervention.
A visual representation of the use case is shown in Figure 6.
A 62-year-old woman with a history of hypertension and prior stroke is reported by her daughter to be “talking strangely,” feeling dizzy, and unable to move her arm. This acute symptom onset occurred approximately five minutes ago. The clinician is not physically present onsite, and the telemedicine platform relies on TriagE-NLU to process this input.
The patient’s metadata (age: 62, pre-existing conditions: hypertension, stroke) and free-text dialogue are combined into a unified multilingual input string that includes the patient’s spoken description and structured clinical context. This input is preprocessed by appending patient age and comorbidities using delimiters, such as [SEP], to form a model-readable string. This format enables TriagE-NLU to jointly encode unstructured text and relevant background information, eliminating the need for separate input channels.
Upon receiving this input, the TriagE-NLU model simultaneously performs semantic parsing and classification. The parsing head identifies symptoms (e.g., slurred speech, unilateral arm weakness), onset time (“sudden”), and urgency (“high”), structuring them into a EUREKA frame. In parallel, the classification head maps the structured understanding to an appropriate medical action, in this case, “activate stroke protocol”.
The output returned to the clinician consists of both the structured semantic representation (“EUREKA FRAME OUTPUT: <extra_id_0>: a stroke”) and the recommended intervention (“RECOMMENDED INTERVENTION: activate stroke protocol”). This enables the clinician to escalate care immediately without having to interpret the patient’s message manually. The entire process, from dialogue input to clinical action, showcases how TriagE-NLU can support timely, life-saving decisions in underserved areas.
Upon receiving the system’s recommendation, the remote clinician immediately initiates the stroke protocol. This includes instructing the onsite staff to prepare for emergency transport and notifying the nearest medical center in advance. Because TriagE-NLU has already extracted and structured the critical indicators of stroke, such as symptom onset within 5 min, speech impairment, and unilateral limb weakness, the clinician can make a rapid, evidence-backed decision without needing to parse the patient’s report manually. This significantly reduces the time to intervention, which is crucial in stroke care where every minute counts.
TriagE-NLU’s Python 3.11.9 code acts as an interpreter between unstructured dialogue and life-saving clinical decision support. It removes ambiguity from natural language and converts it into standardized medical actions.

7.2. TriagE-NLU Evaluation

We evaluate TriagE-NLU using the TriageX benchmark described in Section 6. Our experiments assess model performance across three core tasks: semantic parsing into EUREKA frames, clinical intervention prediction, and multilingual generalization. We present results on simulated subsets of TriageX to evaluate generalization, robustness, and language transfer abilities.
We split the TriageX dataset into training (70%), development (15%), and test (15%) sets, ensuring stratification across languages, urgency levels, and symptom combinations.
To contextualize TriagE-NLU performance, we compare against strong baselines.
  • mT5-Base: Fine-tuned on TriageX using standard seq2seq loss for EUREKA parsing only.
  • mBART: A multilingual encoder/decoder model.
  • MBART + MLP: A multilingual encoder/decoder model with an MLP classification head for intervention prediction.
  • XLM-RoBERTa: A multilingual language model.
  • XLM-RoBERTa + BiLSTM: Token-level slot tagging using a BiLSTM-CRF on top of frozen XLM-R embeddings.
All models are implemented using Hugging Face Transformers and trained on an Intel Core i9 processor, 64GB of RAM, and a GeForce RTX4070Ti graphics card.
Table 2 presents the results obtained for the performance evaluation of all models.
TriagE-NLU achieves the highest scores across all available metrics, with 0.91 Slot F1, 0.89 Frame F1, 0.88 Macro F1 (Intervention), and 0.92 accuracy, demonstrating its balanced effectiveness in both structured semantic parsing and clinical prediction.
In contrast, mT5-Base, a seq2seq model, performs relatively well in parsing (0.84 Slot F1) but lacks support for intervention prediction (hence no Macro F1 reported), and achieves 0.85 accuracy.
Models mBART and mBART + MLP achieve moderate performance in intervention classification, with Macro F1 scores of 0.81 and 0.83, respectively. However, they do not support structured semantic parsing.
XLM-RoBERTa, with and without BiLSTM, shows the weakest overall results. The BiLSTM-enhanced variant achieves 0.79 Slot F1 and 0.78 accuracy, indicating its limited ability to capture frame-level semantics or make precise classifications. The base XLM-R model performs slightly better in classification (Macro F1: 0.80, accuracy: 0.82) but still underperforms compared to TriagE-NLU.
In summary, Table 2 highlights TriagE-NLU’s comprehensive and superior performance, especially its unique ability to handle both semantic parsing and intervention prediction in a unified multilingual setting, capabilities that baseline models support only partially or with lower accuracy.
To provide a more intuitive visual comparison of the results obtained, we include Figure 7. In addition to Slot F1, Frame F1, Macro F1 (Intervention), and accuracy, we include the BLEU score, a sequence-level metric commonly used in machine translation to evaluate the quality of structured frame generation (EUREKA outputs).
Figure 7 enables a more precise observation of performance differences across models and metrics, highlighting TriagE-NLU’s consistent superiority over the baselines. While F1 and accuracy assess correctness in classification or slot-level predictions, BLEU measures how closely the generated sequences match ground-truth outputs in both content and word order. A higher BLEU score indicates better fluency and structural alignment, reflecting the model’s ability to create coherent and faithful clinical representations. TriagE-NLU outperformed all baseline models in both accuracy and BLEU score, achieving 0.92 accuracy and a BLEU score of 0.87, reflecting strong reliability in both classification and structured sequence generation tasks.
To further illustrate TriagE-NLU’s advantages, we analyzed challenging cases involving multilingual phrasing and overlapping symptoms. For example, in the Romanian input “soțul meu e amețit și nu-și poate mișca brațul drept” (“my husband is dizzy and cannot move his right arm”), mT5-Base fails to generate a structured frame, while XLM-R + BiLSTM misses the neurological context due to lack of metadata integration. In contrast, TriagE-NLU accurately extracts both neurological.dizziness and limb.weakness.right slots and recommends activating the stroke protocol. Additionally, mBART + MLP, which lacks a semantic head, cannot identify onset times or structured symptoms, limiting interpretability. These limitations stem from architectural constraints: mT5-Base lacks an explicit classification head, mBART + MLP is not designed for structured generation, and XLM-R + BiLSTM relies on frozen embeddings and cannot leverage patient metadata. TriagE-NLU’s dual-head architecture and integrated metadata encoding allow it to outperform these baselines in both accuracy and clinical interpretability across diverse, multilingual inputs.
These results highlight TriagE-NLU’s robustness across multilingual, multimodal input scenarios.

7.3. Multilingual Transfer and Zero-Shot Generalization

Figure 8 presents the results obtained for the TriagE-NLU performance evaluation regarding semantic parsing and intervention prediction by language.
The results illustrate the F1 scores of TriagE-NLU across five languages, English, Spanish, Romanian, Arabic, and Mandarin, for two key tasks: semantic parsing and intervention prediction.
  • English achieves the highest performance, with 0.92 F1 for semantic parsing and 0.89 for intervention prediction, indicating strong performance in high-resource language settings.
  • Spanish follows closely with 0.90 parsing F1 and 0.88 intervention F1, maintaining robust performance.
  • Romanian and Arabic, both lower-resource languages, show a modest drop: 0.89 and 0.88% parsing F1, and 0.87 and 0.86 intervention F1, respectively.
  • Mandarin exhibits the lowest scores in both categories: 0.87 parsing F1 and 0.84 intervention F1, suggesting room for improvement in very low-resource or complex linguistic settings.
In addition to F1 metrics, we examined BLEU scores for structured frame generation across languages. BLEU showed a similar downward trend: highest for English (0.89) and Spanish (0.87), slightly lower for Romanian (0.85) and Arabic (0.84), and lowest for Mandarin (0.81). This pattern reinforces the observation that the quality of structured output degrades modestly in low-resource or morphologically complex languages. However, the decline remains within an acceptable range for clinical use.
To deepen the analysis, we investigated language-specific strengths and weaknesses. For instance, in English and Spanish, TriagE-NLU accurately captured both onset time and symptom structure even in cases with nested clauses, such as the following:
“My husband began slurring his speech about 15 min ago after taking his meds.”
In contrast, in Mandarin, parsing errors occurred more frequently for temporally ambiguous expressions, e.g., “头晕已经有一会儿了” (“has been dizzy for a while”), which led to reduced frame completeness.
Romanian and Arabic inputs occasionally produced partial frames when metadata was implicitly stated (e.g., age or gender inferred from pronouns), highlighting the need for more fine-grained anaphora resolution.
These observations highlight limitations in both the training data distribution and the model’s capacity to interpret implicit context in morphologically rich or syntactically complex languages.
The TriageX dataset, although multilingual, still exhibits a resource imbalance favoring English and Spanish, which may bias model performance. Furthermore, language-specific annotation inconsistencies (e.g., synonym choices or phrasing variations) may affect label alignment across languages.
Despite these challenges, the relatively small F1 gap between high- and low-resource languages suggests that TriagE-NLU maintains strong multilingual generalization and clinical usability across diverse linguistic settings.
Overall, TriagE-NLU demonstrates strong cross-lingual generalization, with performance degradation gracefully scaling with resource availability. The relatively small performance gap across languages validates the model’s multilingual robustness and its clinical utility in diverse global contexts.

7.4. Ablation Studies

Ablation studies are a standard methodological tool in machine learning research used to assess the contribution of individual components within a model architecture. By systematically removing or disabling specific modules, researchers can evaluate the relative importance of each part in achieving the overall performance. This technique helps validate design decisions, ensures that model improvements are not due to incidental correlations, and enhances the interpretability of complex systems. By selectively disabling specific modules or inputs and measuring performance degradation, researchers can determine whether each architectural feature is essential, redundant, or synergistic. In clinical NLP, ablation studies are used to evaluate the necessity of features by systematically removing data sources (e.g., synthetic data), which consistently results in performance degradation, highlighting the impact of each component for model robustness [33].
Following this approach, we conducted a set of ablation experiments. We apply ablation systematically to our TriagE-NLU architecture to assess the functional contribution of each of the five core modules. The goal is to evaluate whether the combination of patient metadata, semantic parsing, classification, knowledge constraints, and federated learning is essential for robust multilingual clinical understanding and intervention planning.
We define the Full Model or the baseline configuration as the full TriagE-NLU model, with all major components active: patient metadata integration, semantic parsing head (EUREKA slot-filling), intervention classification head, knowledge encoding and constraint enforcement, and federated update loop.
We then evaluate the system under five ablation conditions, each designed to isolate the impact of one architectural component. In the “no metadata condition”, age and comorbidity inputs are removed from the input sequence. The “no semantic parsing head” variant disables structured EUREKA slot-filling, preserving only intervention classification. In contrast, the “no classification head” setup omits intervention prediction entirely, allowing the model to perform only semantic parsing. The “no knowledge constraints” configuration removes ontology-based validation and constraint enforcement in the outputs. Finally, the “no federated learning” condition replaces personalized, distributed updates with a centralized training strategy. Each variant is trained under identical settings and evaluated using the same test set.
We present Slot F1, Frame F1, Macro F1 (intervention), and accuracy (intervention) to measure the impact of each component removed. Slot F1 measures the F1 score for individual semantic slots extracted by the model during the parsing process. Frame F1 measures the F1 score for the entire structured frame (e.g., the EUREKA frame). Macro F1 (intervention) measures the unweighted average of F1 scores across all intervention classes (e.g., “activate stroke protocol”, “monitor only”). Accuracy (intervention) measures the percentage of intervention predictions that exactly match the ground truth. After discussing each ablation study, a comparison of the obtained scores will be provided and analyzed.
To ensure a fair comparison across all ablation variants and the full model, we maintained consistent training configurations for each experiment. All models were fine-tuned using the google/mT5-Base backbone with a maximum sequence length of 256 tokens, a batch size of eight, and a learning rate of 3 × 10−5. We used the AdamW optimizer with linear learning rate scheduling and warm-up steps set to 10% of the total training steps. Each model was trained for three epochs on the same training and development splits of the TriagE-X dataset. We employed early stopping based on validation loss with a patience of 1 epoch to prevent overfitting. The experiments are detailed below.
Full model (TriagE-NLU).
The results achieved with the full model are 0.91 Slot F1 score, 0.89 Frame F1 score, 0.88 Macro F1 (intervention) and 0.92 accuracy.
Ablation Study 1. Without Patient Metadata.
We exclude structured patient background information, specifically age and pre-existing conditions, from the input sequence. This configuration tests the model’s ability to perform semantic parsing and intervention classification based solely on the raw clinical dialogue, without the contextual enrichment provided by patient demographics and comorbidities. In the baseline setup, metadata is appended using special delimiters (e.g., [SEP]) to help the model jointly encode both free-text utterances and structured medical context. By removing this metadata, we assess the extent to which such auxiliary information contributes to the accuracy and precision of symptom extraction and intervention decisions. This ablation is significant in evaluating performance when metadata may be unavailable or incomplete.
In this experiment, the model achieved a Slot F1 score of 0.82, down from 0.91 in the full model. The Frame F1 score dropped from 0.89 to 0.78, indicating a lower ability to assemble consistent structured representations. On the intervention prediction task, the model recorded a Macro F1 (intervention) of 0.75 and accuracy of 0.81, reflecting reduced reliability in classifying appropriate actions due to missing patient context.
A failed example in this experiment is where the input states: “She’s dizzy and can’t feel her arm.”. Without metadata, the model fails to prioritize stroke as the primary diagnosis and instead recommends general observation or a low-urgency triage category. However, in the full model, with metadata indicating that the patient is 75 years old with a history of stroke and diabetes, TriagE-NLU correctly identifies this as a high-risk stroke scenario and recommends activating the stroke protocol immediately.
This illustrates how the absence of metadata could lead to an underestimation of urgency and incorrect intervention. The integration of patient metadata is essential to the architecture as it provides vital contextual cues that significantly enhance the model’s ability to accurately interpret symptoms, construct consistent clinical frames, and recommend appropriate interventions, capabilities that diminish notably when this component is removed.
Ablation Study 2. Without Semantic Parsing Head.
In this configuration, the semantic parsing head responsible for extracting structured EUREKA frames is disabled, and the model only performs intervention classification based on unstructured input. This design eliminates the ability to explicitly capture fine-grained clinical details, such as symptom types, onset times, or risk indicators. As a result, the model operates as a black-box classifier, lacking intermediate symbolic reasoning.
Performance evaluation revealed a noticeable decline in interpretability, although the Macro F1 (intervention) remained relatively stable at 0.87 and accuracy 0.90. However, Slot F1 and Frame F1 metrics could not be computed due to the absence of structured outputs.
This highlights the crucial role of the parsing head not only for downstream explainability but also for enabling fine-grained medical documentation, rule-based post-processing, and clinical auditability. The ablation confirms that semantic parsing, while not always strictly necessary for final classification accuracy, is indispensable for producing interpretable and trustworthy outputs in high-stakes clinical settings.
For instance, in a case where the input described a 70-year-old man experiencing dizziness, slurred speech, and right arm weakness, classic signs of stroke, the ablated model still predicted “activate stroke protocol” as the correct intervention. However, without the semantic parsing head, it failed to extract and represent these symptoms explicitly in the EUREKA frame. This lack of structured reasoning meant the system could not highlight why the decision was made, undermining interpretability. In a clinical setting, this would hinder a clinician’s ability to verify or override the system’s recommendation and make it impossible to log structured evidence for audit or downstream integration.
Ablation Study 3. Without Intervention Classification Head.
In this experiment, we removed the intervention classification head, allowing the model to perform only semantic parsing without producing a direct recommendation. While the EUREKA frame remained intact, accurately capturing symptoms, onset time, and urgency, the absence of an explicit intervention prediction disrupted the system’s ability to complete the decision-making loop.
Performance metrics reflect this gap: the model retained a high Slot F1 score of 0.90 and a Frame F1 of 0.88, indicating that the extraction of structured medical information was largely unaffected. However, since no intervention output was available, both the Macro F1 (intervention) and accuracy (intervention) scores were undefined. This ablation demonstrates that while semantic understanding can be preserved, the classification head is essential for actionable triage. Without it, clinicians would need to manually interpret the structured outputs, introducing delay and variability in emergency response. The classification head thus plays a pivotal role in transforming understanding into timely and standardized clinical decisions.
In a stroke-related example, a 66-year-old woman presented with slurred speech, facial drooping, and sudden weakness in her right arm. The system, even without the intervention classification head, correctly extracted and structured these symptoms into the EUREKA frame, identifying the urgency as “high.” However, without the classification module, the system failed to map this structured understanding to the correct medical action, namely, “activate stroke protocol.” Consequently, no clear directive was delivered to the remote clinician. While the semantic structure was accurate, the lack of an explicit intervention recommendation delayed the clinical response in a time-critical scenario where minutes count.
Ablation Study 4. Without Knowledge Encoding Module.
In this case, we removed the ontology-based knowledge encoding and constraint enforcement mechanisms from the system. These components normally act as structural filters that validate the semantic and classification outputs against domain knowledge, ensuring that the model’s predictions remain medically consistent and clinically plausible. Without these constraints, the system had access to the same training data and model parameters but was permitted to produce unconstrained outputs. As a result, we observed an increase in noisy or illogical predictions. For example, in one case, the model assigned the symptom “acute chest pain” a low urgency level and incorrectly linked it to a non-emergency action such as “recommend rest,” despite the presence of metadata indicating cardiovascular risk. This type of mismatch would not have passed the validation layer in the full model.
Performance-wise, the model without knowledge constraints yielded a Slot F1 of 0.86 and a Frame F1 of 0.80, both lower than the full model’s scores of 0.91 and 0.89, respectively. Intervention prediction was also affected, with a Macro F1 of 0.77 and an accuracy of 0.83, compared to 0.88 and 0.92 in the complete system. These results highlight how domain constraints improve both structural fidelity and decision-making reliability, preventing the generation of medically inappropriate outputs.
Ablation Study 5. Without Federated Update Loop.
In this final ablation, we disabled the federated update mechanism and trained the model in a centralized fashion using aggregated data. This approach removes local model personalization and ignores the distributional differences across institutions or regions that are naturally captured in federated learning settings. Without federated fine-tuning, the model exhibited reduced adaptability to local linguistic or clinical nuances, especially in multilingual or region-specific cases. For instance, in a Romanian-language input describing a stroke scenario with subtle phrasing (“nu poate vorbi și cade într-o parte”, meaning that the person is unable to speak and collapses to one side), the centrally trained model failed to interpret the symptom structure correctly and assigned an urgency level of “moderate.” In contrast, the federated version had learned to recognize this formulation from regional data and responded with the correct “activate stroke protocol” intervention.
Quantitatively, the non-federated model achieved a Slot F1 of 0.88 and a Frame F1 of 0.83, compared to 0.91 and 0.89 in the full model. The Macro F1 for intervention classification dropped to 0.79 (from 0.88), and accuracy fell to 0.86 (from 0.92). These drops demonstrate the contribution of federated learning in improving generalizability while preserving local relevance, especially in geographically or linguistically diverse deployments.
Summary of Ablation Results.
The ablation studies demonstrate the necessity of each architectural module in achieving TriagE-NLU’s full performance. Removing patient metadata significantly impairs semantic understanding, highlighting the value of contextual enrichment. Disabling the semantic parsing head or classification head individually reveals how each contributes distinct yet complementary capabilities, interpretability, and clinical decision mapping. The knowledge constraint module helps reduce noisy outputs, while federated updates boost performance in multilingual and heterogeneous deployment settings. Together, these findings confirm that TriagE-NLU’s architecture relies on the synergistic contribution of its components to enable accurate, structured, and context-aware emergency triage.
To visually summarize these findings, in Figure 9 we present a heatmap showing the performance of each model configuration across four key metrics: Slot F1, Frame F1, Macro F1 (intervention), and accuracy (intervention). The heatmap makes clear that the full model consistently outperforms its ablated counterparts, underscoring the cumulative importance of the integrated components.

8. Conclusions and Future Work

This paper introduced TriagE-NLU, a novel, multilingual clinical NLP system designed for interpretation and intervention support in telemedicine contexts. Unlike prior approaches that specialize in either semantic parsing or classification, TriagE-NLU unifies both capabilities under a federated, low-resource-optimized architecture. Evaluated across multilingual emergency scenarios, the model demonstrated state-of-the-art performance, achieving 0.91 F1 in semantic parsing, 0.89 F1 in intervention classification, 0.92 overall accuracy, and a BLEU score of 0.87, substantially outperforming established baselines like mT5-Base, mBART + MLP, and XLM-RoBERTa + BiLSTM.
Our findings highlight TriagE-NLU’s ability to generalize across languages and conditions, offering actionable medical understanding even in low-connectivity, multilingual rural environments. The architecture’s integration of structured symptom reasoning and decision grounding, combined with federated learning for privacy-preserving adaptation, marks a significant contribution to clinical NLP and digital health infrastructure. To our knowledge, this is the first system that jointly addresses multilingual clinical semantic parsing and intervention recommendation in a federated triage setting.
These results affirm that TriagE-NLU is not only a technical advancement but also a possible future deployable solution with high clinical utility in time-sensitive emergency workflows.
In future work, we plan to pursue several concrete directions to strengthen TriagE-NLU’s capabilities further. First, we will extend the system beyond initial triage to support broader diagnostic reasoning, including differential diagnosis and follow-up care suggestions. Second, we aim to incorporate wearable sensor data (e.g., vitals, motion tracking, fall detection) into the NLU pipeline, enabling multimodal triage interpretation that combines natural language with physiological signals. Third, we are working to expand the TriageX dataset with more clinically diverse cases, additional low-resource languages, and transcripts to improve robustness and coverage.
In parallel, we intend to initiate collaborations with healthcare providers, EMS agencies, and policy stakeholders to pilot the integration of TriagE-NLU into emergency response workflows. Such partnerships will be critical for validating system impact in practice, refining usability under operational constraints, and aligning the model with regulatory and clinical governance standards.
In conclusion, we consider TriagE-NLU a promising step toward enhancing digitally supported emergency care, with the potential to improve access to reliable triage assistance across diverse linguistic, geographic, and resource-constrained settings.

Author Contributions

Conceptualization, B.-M.B. and I.S.; methodology, B.-M.B.; software, B.-M.B.; validation, B.-M.B., I.S. and A.-C.P.-N.; investigation, B.-M.B.; data curation, B.-M.B. writing—original draft preparation, B.-M.B.; visualization, B.-M.B.; supervision, I.S. and A.-C.P.-N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

TriageX dataset and further inquiries are available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. TriagE-NLU Implementation

The TriagE-NLU prototype is implemented in Python 3.11.9 using the Hugging Face Transformers library for multilingual semantic parsing and Flower for federated learning simulation. The architecture consists of two main components: a multilingual semantic parser based on mT5 and a secondary classification head for medical intervention prediction. To implement the proposed TriagE-NLU, the following steps were followed:
  • Preprocessing and Data Handling involve several steps. First, multilingual inputs from the TriageX dataset are loaded and tokenized. Patient metadata is then normalized and merged with the corresponding input text to provide contextual grounding. Finally, the target EUREKA frames are converted into linearized string formats suitable for sequence generation tasks.
  • Semantic Parsing (mT5) is performed by fine-tuning the google/mT5-Base model in a sequence-to-sequence setup. The input follows the format: “<lang>: <dialogue> [SEP] <age> [SEP] <conditions>”, and the output corresponds to a serialized EUREKA frame represented as a JSON-like string. To ensure structural validity, constrained decoding rules are applied during generation to enforce correct EUREKA slot formatting.
  • Intervention Classification Head is implemented by attaching an MLP classifier either on top of the encoder output or the parsed EUREKA frame. This classifier is trained using a multitask learning approach, either jointly with the semantic parser or in a sequential setup, depending on the configuration.
  • Federated Learning Simulation uses Flower to simulate clients.
  • The Evaluation process includes computing frame-level and slot-level F1 scores for the EUREKA semantic parsing task, as well as measuring accuracy and Macro F1 for intervention classification. Additionally, performance is analyzed separately for each language to assess cross-lingual generalization.

References

  1. Jerfy, A.; Selden, O.; Balkrishnan, R. The Growing Impact of Natural Language Processing in Healthcare and Public Health. INQUIRY 2024, 61, 00469580241290095. [Google Scholar] [CrossRef]
  2. Biswas, P.; Arockiam, D. Review on Multi-lingual Sentiment Analysis in Health Care. J. Electr. Syst. 2024, 20, 3394–3405. [Google Scholar] [CrossRef]
  3. Supriyono, A.P.W.; Suyono, F.K. Advancements in Natural Language Processing: Implications, Challenges, and Future Directions. Telemat. Inform. Rep. 2024, 16, 100173. [Google Scholar] [CrossRef]
  4. Basra, J.; Saravanan, R.; Rahul, A. A survey on Named Entity Recognition—datasets, tools, and methodologies. Nat. Lang. Process. J. 2023, 3, 100017. [Google Scholar] [CrossRef]
  5. Mandale-Jadhav, A. Text Summarization Using Natural Language Processing. J. Electr. Syst. 2025, 20, 3410–3417. [Google Scholar] [CrossRef]
  6. Upadhyay, P.; Agarwal, R.; Dhiman, S.; Sarkar, A.; Chaturvedi, S. A comprehensive survey on answer generation methods using NLP. Nat. Lang. Process. J. 2024, 8, 100088. [Google Scholar] [CrossRef]
  7. Sun, H.; Peng, J.; Yang, W.; He, L.; Du, B.; Yan, R. Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment. arXiv 2025, arXiv:2506.10877. [Google Scholar] [CrossRef]
  8. Li, X.; Peng, L.; Wang, Y.P.; Zhang, W. Open challenges and opportunities in federated foundation models towards biomedical healthcare. BioData Min. 2025, 18, 2. [Google Scholar] [CrossRef]
  9. Li, I.; Rao, S.; Solares, J.R.A.; Hassaine, A.; Ramakrishnan, R.; Canoy, D.; Zhu, Y.; Rahimi, K.; Salimi-Khorshidi, G. BEHRT: Transformer for Electronic Health Records. Sci. Rep. 2020, 10, 7155. [Google Scholar] [CrossRef]
  10. Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.-H.; Jindi, D.; Naumann, T.; McDermott, M. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA, 7 June 2019; pp. 72–78. [Google Scholar] [CrossRef]
  11. Kocaballi, A.B.; Berkovsky, S.; Quiroz, J.C.; Laranjo, L.; Tong, H.L.; Rezazadegan, D.; Briatore, A.; Coiera, E. The Personalization of Conversational Agents in Health Care: Systematic Review. J. Med. Internet Res. 2019, 21, e15360. [Google Scholar] [CrossRef]
  12. Milne-Ives, M.; de Cock, C.; Lim, E.; Shehadeh, M.H.; de Pennington, N.; Mole, G.; Normando, E.; Meinert, E. The effectiveness of artificial intelligence conversational agents in healthcare. J. Med. Internet Res. 2020, 22, e20346. [Google Scholar] [CrossRef] [PubMed]
  13. Zeng, G.; Yang, W.; Ju, Z.; Yang, Y.; Wang, S.; Zhang, R.; Zhou, M.; Zeng, J.; Dong, X.; Zhang, R.; et al. MedDialog: Large-scale Medical Dialogue Dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar] [CrossRef]
  14. Zang, T.; Cai, Z.; Wang, C.; Qiu, M.; Yang, B.; He, X. SMedBERT: A Knowledge-Enhanced Dialogue Generation Model for Medical Consultation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021. [Google Scholar] [CrossRef]
  15. Sutton, R.T.; Pincock, D.; Baumgart, D.C.; Sadowski, D.C.; Fedorak, R.N.; Kroeker, K.I. An overview of clinical decision support systems: Benefits, risks, and strategies for success. NPJ Digit. Med. 2020, 3, 17. [Google Scholar] [CrossRef] [PubMed]
  16. Wu, Z.; Dadu, A.; Nalls, M.; Faghri, F.; Sun, J. Instruction Tuning Large Language Models to Understand Electronic Health Records. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024; Available online: https://proceedings.neurips.cc/paper_files/paper/2024/file/62986e0a78780fe5f17b495aeded5bab-Paper-Datasets_and_Benchmarks_Track.pdf (accessed on 24 June 2025).
  17. Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. NPJ Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef] [PubMed]
  18. Sheller, M.J.; Edwards, B.; Reina, G.A.; Martin, J.; Pati, S.; Kotrotsou, A.; Milchenko, M.; Xu, W.; Marcus, D.; Colen, R.R.; et al. Federated learning in medicine: Facilitating multi-institutional collaborations without sharing patient data. Sci. Rep. 2020, 10, 12598. [Google Scholar] [CrossRef]
  19. Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021. [Google Scholar] [CrossRef]
  20. Tang, Y.; Tran, C.; Li, X.; Chen, P.-J.; Goyal, N.; Chaudhary, V.; Gu, J.; Fan, A. Multilingual Translation from Denoising Pre-Training. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 3450–3466. Available online: https://aclanthology.org/2021.findings-acl.304/ (accessed on 24 June 2025).
  21. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzman, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar] [CrossRef]
  22. Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Flores, M.G.; Zhang, Y.; et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. arXiv 2022, arXiv:2203.03540. [Google Scholar] [CrossRef]
  23. Liu, W.; Tang, J.; Cheng, Y.; Li, W.; Zheng, Y.; Liang, X. MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware Medical Dialogue Generation. arXiv 2022, arXiv:2010.07497. [Google Scholar] [CrossRef]
  24. Liu, T.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual Denoising Pre-training for Neural Machine Translation. arXiv 2020, arXiv:2001.08210. [Google Scholar] [CrossRef]
  25. Kusumawardani, R.P.; Kusumawati, K.N. Named entity recognition in the medical domain for Indonesian language health consultation services using bidirectional-lstm-crf algorithm. Procedia Comput. Sci. 2024, 245, 1146–1156. [Google Scholar] [CrossRef]
  26. Rojas-Carabali, W.; Agrawal, R.; Gutierrez-Sinisterra, L.; Baxter, S.L.; Cifuentes-González, C.; Wei, Y.C.; Abisheganaden, J.; Kannapiran, P.; Wong, S.; Lee, B.; et al. Natural Language Processing in medicine and ophthalmology: A review for the 21st-century clinician. Asia Pac. J. Ophthalmol. 2024, 13, 100084. [Google Scholar] [CrossRef]
  27. Benjamin, C.; Henry, I.; Bergman, S.S.; Gabriele, P.; Joseph, S.; Alun, H.D. Natural language processing techniques applied to the electronic health record in clinical research and practice-an introduction to methodologies. Comput. Biol. Med. 2025, 188, 09808. [Google Scholar] [CrossRef]
  28. Thatoi, P.; Choudhary, R.; Shiwlani, A.; Qureshi, H.A.; Kuma, S. Natural Language Processing (NLP) in the Extraction of Clinical Information from Electronic Health Records (EHRs) for Cancer Prognosis. Int. J. Membr. Sci. Technol. 2023, 10, 2676–2694. [Google Scholar]
  29. Sezgin, E.; Hussain, S.A.; Rust, S.; Huang, Y. Extracting Medical Information from Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study with Real-world Data. JMIR Form. Res. 2023, 7, e43014. [Google Scholar] [CrossRef]
  30. Yang, C.; Deng, J.; Chen, X.; An, Y. SPBERE: Boosting span-based pipeline biomedical entity and relation extraction via entity information. J. Biomed. Inform. 2023, 145, 104456. [Google Scholar] [CrossRef]
  31. Haleem, A.; Javaid, M.; Singh, R.P.; Suman, R. Telemedicine for healthcare: Capabilities, features, barriers, and applications. Sens. Int. 2021, 2, 100117. [Google Scholar] [CrossRef]
  32. Mermin-Bunnell, K.; Zhu, Y.; Hornback, A.; Damhorst, G.; Walker, T.; Robichaux, C.; Mathew, L.; Jaquemet, N.; Peters, K.; Johnson, T.M.; et al. Use of Natural Language Processing of Patient-Initiated Electronic Health Record Messages to Identify Patients With COVID-19 Infection. JAMA Netw. Open. 2023, 6, e2322299. [Google Scholar] [CrossRef]
  33. Guevara, M.; Chen, S.; Thomas, S.; Chaunzwa, T.L.; Franco, I.; Kann, B.H.; Moningi, S.; Qian, J.M.; Goldstein, M.; Harper, S.; et al. Large language models to identify social determinants of health in electronic health records. NPJ Digit. Med. 2024, 7, 6. [Google Scholar] [CrossRef] [PubMed]
Figure 1. System-level overview of the proposed TriagE-NLU.
Figure 1. System-level overview of the proposed TriagE-NLU.
Futureinternet 17 00314 g001
Figure 2. Pipeline overview of the proposed TriagE-NLU.
Figure 2. Pipeline overview of the proposed TriagE-NLU.
Futureinternet 17 00314 g002
Figure 3. Eureka scheme.
Figure 3. Eureka scheme.
Futureinternet 17 00314 g003
Figure 4. Proposed training methodology.
Figure 4. Proposed training methodology.
Futureinternet 17 00314 g004
Figure 5. TriageX examples in (a) English; (b) Spanish; (c) Arabic; (d) Mandarin; (e) Romanian.
Figure 5. TriageX examples in (a) English; (b) Spanish; (c) Arabic; (d) Mandarin; (e) Romanian.
Futureinternet 17 00314 g005
Figure 6. Use-case scenario for TriagE-NLU.
Figure 6. Use-case scenario for TriagE-NLU.
Futureinternet 17 00314 g006
Figure 7. Semantic parsing, intervention prediction, accuracy and BLEU comparison results between TriagE-NLU and baseline models.
Figure 7. Semantic parsing, intervention prediction, accuracy and BLEU comparison results between TriagE-NLU and baseline models.
Futureinternet 17 00314 g007
Figure 8. TriagE-NLU evaluation: performance by language.
Figure 8. TriagE-NLU evaluation: performance by language.
Futureinternet 17 00314 g008
Figure 9. Heatmap showing performance impact of ablation studies on TriagE-NLU across key evaluation metrics, highlighting the contribution of each architectural component.
Figure 9. Heatmap showing performance impact of ablation studies on TriagE-NLU across key evaluation metrics, highlighting the contribution of each architectural component.
Futureinternet 17 00314 g009
Table 1. Novelty of TriagE-NLU compared to prior work in clinical NLP.
Table 1. Novelty of TriagE-NLU compared to prior work in clinical NLP.
Feature/CapabilityPrior Work in Clinical NLPTriagE-NLU
Dialogue UnderstandingMostly retrospective text [26,27]Supports multi-turn emergency dialogue
Patient-Aware
Semantic Parsing
Typically text-only models [9,10]Integrates age, vitals, onset metadata
Multilingual and Low-Resource
Generalization
Monolingual focus [13,21]Designed for multilingual settings via transfer learning
Grounding in Medical InterventionsMost focus on diagnosis or symptom extraction, not intervention mapping [28,29]Maps parsed intent to clinical actions
Federated Learning for Privacy
Adaptation
Centralized learning [30]Federated training with privacy-preserving
aggregation
Emergency Telemedicine FocusGeneral clinical use or outpatient consults [31,32]Tailored to high-stakes, urgent scenarios
Triage Benchmark with InterventionsNo available datasets with multilingual, intervention-labeled emergency conversationsIntroduces TriageX dataset
Table 2. Comparative evaluation of TriagE-NLU and baseline models.
Table 2. Comparative evaluation of TriagE-NLU and baseline models.
ModelSlot F1Frame F1Macro F1
(Intervention)
Accuracy
TriagE-NLU0.910.890.880.92
mT5-Base (Seq2Seq)0.840.85
mBART0.810.83
mBART + MLP0.830.81
XLM-ROBERTa0.800.82
XLM-ROBERTa + BiLSTM0.790.78
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Balaban, B.-M.; Sacală, I.; Petrescu-Niţă, A.-C. TriagE-NLU: A Natural Language Understanding System for Clinical Triage and Intervention in Multilingual Emergency Dialogues. Future Internet 2025, 17, 314. https://doi.org/10.3390/fi17070314

AMA Style

Balaban B-M, Sacală I, Petrescu-Niţă A-C. TriagE-NLU: A Natural Language Understanding System for Clinical Triage and Intervention in Multilingual Emergency Dialogues. Future Internet. 2025; 17(7):314. https://doi.org/10.3390/fi17070314

Chicago/Turabian Style

Balaban, Béatrix-May, Ioan Sacală, and Alina-Claudia Petrescu-Niţă. 2025. "TriagE-NLU: A Natural Language Understanding System for Clinical Triage and Intervention in Multilingual Emergency Dialogues" Future Internet 17, no. 7: 314. https://doi.org/10.3390/fi17070314

APA Style

Balaban, B.-M., Sacală, I., & Petrescu-Niţă, A.-C. (2025). TriagE-NLU: A Natural Language Understanding System for Clinical Triage and Intervention in Multilingual Emergency Dialogues. Future Internet, 17(7), 314. https://doi.org/10.3390/fi17070314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop