Balanced Knowledge Transfer in MTTL-ClinicalBERT: A Symmetrical Multi-Task Learning Framework for Clinical Text Classification

Zhang, Qun; Chen, Shiyang; Liu, Wenhe

doi:10.3390/sym17060823

Open AccessArticle

Balanced Knowledge Transfer in MTTL-ClinicalBERT: A Symmetrical Multi-Task Learning Framework for Clinical Text Classification

by

Qun Zhang

¹,

Shiyang Chen

^2,* and

Wenhe Liu

³

¹

Department of Statistics and Biostatistics, California State University, East Bay, Hayward, CA 94542, USA

²

College of Engineering, Texas A&M University, College Station, TX 77840, USA

³

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 823; https://doi.org/10.3390/sym17060823

Submission received: 5 May 2025 / Revised: 21 May 2025 / Accepted: 22 May 2025 / Published: 25 May 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Clinical text classification presents significant challenges in healthcare informatics due to inherent asymmetries in domain-specific terminology, knowledge distribution across specialties, and imbalanced data availability. We introduce MTTL-ClinicalBERT, a symmetrical multi-task transfer learning framework that harmonizes knowledge sharing across diverse medical specialties while maintaining balanced performance. Our approach addresses the fundamental problem of symmetry in knowledge transfer through three innovative components: (1) an adaptive knowledge distillation mechanism that creates symmetrical information flow between related medical domains while preventing negative transfer; (2) a bidirectional hierarchical attention architecture that establishes symmetry between local terminology analysis and global contextual understanding; and (3) a dynamic task-weighting strategy that maintains equilibrium in the learning process across asymmetrically distributed medical specialties. Extensive experiments on the MTSamples dataset demonstrate that our symmetrical approach consistently outperforms asymmetric baselines, achieving average improvements of 7.2% in accuracy and 6.8% in F1-score across five major specialties. The framework’s knowledge transfer patterns reveal a symmetric similarity matrix between specialties, with strongest bidirectional connections between cardiovascular/pulmonary and surgical domains (similarity score 0.83). Our model demonstrates remarkable stability and balance in low-resource scenarios, maintaining over 85% classification accuracy with only 30% of training data. The proposed framework not only advances clinical text classification through its symmetrical design but also provides valuable insights into balanced information sharing between different medical domains, with broader implications for symmetrical knowledge transfer in multi-domain machine learning systems.

Keywords:

clinical text classification; balanced knowledge transfer; symmetrical multi-task learning; medical natural language processing

1. Introduction

Clinical text classification plays a pivotal role in modern healthcare systems, serving as a fundamental tool for extracting actionable insights from unstructured medical records, including discharge summaries, surgical reports, and clinical notes. The accurate classification of these documents is crucial for improving patient care, facilitating clinical decision-making, and advancing medical research [1,2]. These clinical documents contain vital information that can enhance diagnostic accuracy, treatment planning, and clinical workflow optimization. However, despite its significance, clinical text classification faces several intricate challenges that demand innovative solutions addressing the inherent symmetry imbalances in medical documentation across specialties.

The primary challenge lies in the inherent complexity and asymmetrical nature of medical documentation, where a single clinical record often encompasses information spanning multiple medical specialties without symmetrical representation. This multi-faceted nature of clinical texts, combined with the domain-specific terminology and complex medical jargon, creates significant obstacles for traditional classification approaches [3]. For instance, a post-operative report might contain detailed surgical procedures while simultaneously discussing cardiological considerations and neurological observations, making it difficult to establish clear categorical boundaries [4,5]. This complexity is further compounded by the variability in documentation styles across different healthcare institutions and medical practitioners, leading to inconsistencies in terminology usage and narrative structure. The scarcity of large-scale labeled clinical datasets presents another significant barrier. Healthcare data is subject to strict privacy regulations, such as HIPAA, which severely restricts data sharing and accessibility [6]. Moreover, the annotation of clinical texts requires extensive domain expertise, making the labeling process both time-consuming and costly [7]. This data limitation particularly impacts the development of deep learning models, which typically require substantial amounts of labeled data for effective training [8,9]. The challenge is exacerbated by the imbalanced nature of medical data, where certain conditions or specialties may be underrepresented, leading to biased model performance.

Traditional natural language processing approaches, while successful in general domain applications, often fall short when applied to clinical texts. Standard techniques such as bag-of-words or TF-IDF struggle to capture the nuanced semantics of medical terminology and the complex relationships between different medical concepts [10,11]. These methods typically treat words as independent tokens, failing to account for the hierarchical nature of medical knowledge and the contextual dependencies that are crucial for accurate interpretation. For example, the term “cold” could refer to a temperature sensation, an upper respiratory infection, or a chronic condition, depending on the context. Furthermore, while pre-trained language models have revolutionized NLP tasks, they face significant limitations in the clinical domain. These models, though powerful in general contexts, may fail to grasp domain-specific medical terminology and relationships [12,13,14]. Several attempts have been made to address these limitations through domain-specific pre-training [15,16], but these approaches often suffer from limited coverage of medical vocabulary and domain-specific abbreviations. They struggle with understanding temporal relationships in clinical narratives, demonstrate poor handling of long-range dependencies in extensive clinical documents, and show an inability to effectively transfer knowledge across different medical specialties. Clinical documentation demonstrates significant variation across medical specialties, with each field developing its own conventions, terminology preferences, and narrative structures. For instance, radiology reports often follow highly structured formats with standardized sections, while surgical notes tend to contain detailed procedural narratives with specialty-specific terminology. Neurological assessments frequently employ specialized scoring systems and observation frameworks not commonly used in other fields. These differences in documentation standards present significant challenges for traditional classification approaches that treat all clinical texts uniformly.

To address these challenges, we propose MTTL-ClinicalBERT, a novel multi-task transfer learning framework specifically designed for clinical text classification. Our approach introduces three key innovations that directly address the limitations of existing methods by establishing symmetrical knowledge pathways between specialties. First, we develop an adaptive knowledge distillation mechanism that facilitates the transfer of domain knowledge between related medical specialties. Unlike previous approaches that use uniform knowledge transfer [17], our mechanism selectively transfers knowledge based on specialty similarity, preventing negative transfer while maximizing beneficial knowledge sharing. Second, we implement a symmetrically balanced hierarchical attention architecture that simultaneously processes local clinical terminology and global contextual patterns. This architecture improves upon existing attention mechanisms by incorporating medical domain knowledge and maintaining equilibrium between both fine-grained medical terms and document-level semantic relationships. Third, we introduce a dynamic task-weighting strategy that optimizes the learning process across multiple medical classification objectives [18,19]. This approach adaptively adjusts the importance of different tasks based on their difficulty and inter-relationships, ensuring balanced learning across specialties.

Our MTTL-ClinicalBERT framework specifically addresses these cross-specialty variations through its hierarchical attention architecture, which learns to recognize specialty-specific documentation patterns while simultaneously identifying shared clinical concepts. Rather than developing separate models for each specialty—an approach that would ignore valuable cross-specialty information and require significantly more training data—we demonstrate that a unified multi-specialty model offers several key advantages: (1) improved performance through shared representation learning where specialties with similar documentation characteristics mutually reinforce each other; (2) more robust handling of cases that span multiple specialties, which are increasingly common in modern medicine; and (3) better generalization to low-resource specialties by leveraging knowledge from data-rich domains. Our adaptive knowledge distillation mechanism specifically addresses the risk of negative transfer between dissimilar specialties while maximizing beneficial knowledge sharing where appropriate. Specifically, our innovations address the limitations of previous work in several critical ways. Unlike existing clinical BERT variants that rely solely on masked language modeling [15], our model incorporates specialty-specific knowledge through adaptive distillation. In contrast to traditional attention mechanisms that treat all words equally, our hierarchical attention specifically focuses on medical terminology and their relationships. While previous multi-task learning approaches use static task weights [20], our dynamic weighting strategy adapts to the learning progress of each specialty.

We validate our approach on the MTSamples [21] dataset, comprising 4998 medical transcriptions across various specialties. Our experimental results demonstrate that MTTL-ClinicalBERT significantly outperforms existing state-of-the-art models, achieving substantial improvements in both accuracy and F1-score. The model shows an average improvement of 7.2% in classification accuracy across all specialties and a 6.8% increase in macro F1-score compared to the best baseline model. Notably, our model maintains robust performance even in low-resource scenarios, demonstrating effective knowledge transfer and generalization capabilities [22,23]. Specifically, it maintains over 85% accuracy with only 30% of training data, suggesting that our approach successfully addresses the key challenges in clinical text classification while providing a foundation for future developments in medical natural language processing.

The remainder of this paper is organized as follows. Section 2 reviews related work in clinical text classification, transfer learning, and multi-task transfer learning in the medical domain, providing a comprehensive overview of existing approaches and their limitations. Section 3 describes our proposed MTTL-ClinicalBERT framework in detail, including thorough mathematical formulations and architectural specifications. Section 4 presents our experimental setup and results, with detailed analyses of model performance across different scenarios. Finally, Section 5 concludes the paper and discusses future research directions, including potential extensions and applications of our framework.

2. Related Work

2.1. Clinical Text Classification

Early approaches to clinical text classification relied heavily on traditional machine learning models such as Support Vector Machines (SVMs) and Naive Bayes, which operated on manually engineered features including bag-of-words and term frequency-inverse document frequency (TF-IDF) representations [24,25]. While these methods provided a foundation for automated clinical text analysis, their dependence on handcrafted features limited their ability to capture the inherent complexity of medical terminology and context [10]. The advent of deep learning brought significant advances through the application of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks [26,27]. These architectures demonstrated superior capability in processing sequential clinical data, though they faced challenges with long-range dependencies and computational efficiency. Concurrently, Convolutional Neural Networks (CNNs) showed promise in identifying local patterns within clinical documents but struggled with capturing broader contextual relationships [28,29]. The emergence of transformer-based architectures marked a pivotal advancement in clinical text classification. Models like BERT and RoBERTa achieved remarkable performance improvements through their self-attention mechanisms [30,31]. However, their token length limitations (typically 512 tokens) posed significant constraints for processing lengthy clinical documents. This limitation led to the development of specialized architectures like Clinical-Longformer, which employed sparse attention mechanisms to handle extended sequences [32,33]. Recent developments have focused on addressing domain-specific challenges in clinical text classification. Researchers have tackled data imbalance issues through techniques like Synthetic Minority Oversampling Technique (SMOTE) [34,35], though synthetic data generation often introduces noise that fails to capture the nuanced complexity of clinical cases. Domain-specific tools like SciSpacy have emerged for medical entity recognition [36], while multi-task learning approaches like MIMIC-Extract have demonstrated the potential for simultaneous optimization of multiple clinical objectives [9,37].

2.2. Transfer Learning in Clinical NLP

Transfer learning has revolutionized clinical natural language processing by enabling models to leverage knowledge from general domain texts and adapt it to specific clinical applications. The prominent success of BERT-based models, fine-tuned on clinical datasets like MIMIC-III, has established a new paradigm for clinical text understanding [9,30]. This approach has proven particularly valuable given the scarcity of high-quality annotated clinical data [37]. Domain-specific adaptations of transformer models, such as ClinicalBERT and BioBERT, have further advanced the field by incorporating specialized medical knowledge during pre-training [15,16]. These models exhibit superior performance in tasks like ICD coding and adverse event detection, though their specialization sometimes limits their generalizability [32]. Long-document processing challenges have been addressed through innovations like Clinical-Longformer [33], albeit with increased computational demands [34]. The field has also seen significant progress in cross-domain knowledge transfer, enabling models to adapt from general healthcare scenarios to specialized medical domains [17,37]. Recent innovations include zero-shot and few-shot learning approaches [38], which are particularly valuable in clinical settings where labeled data is scarce. Self-supervised learning techniques have emerged as promising solutions for leveraging unlabeled clinical data while maintaining patient privacy [36]. Current research directions focus on addressing remaining challenges, particularly in handling imbalanced datasets through cost-sensitive learning and dynamic weighting strategies [35,38]. These approaches aim to improve model performance on underrepresented medical conditions while maintaining robust general performance across diverse clinical scenarios.

2.3. Emerging Approaches in Clinical NLP

Recent years have witnessed the emergence of novel methodologies that address specific constraints in clinical NLP, particularly regarding data privacy and resource limitations. Federated learning has gained significant attention as a privacy-preserving approach that enables model training across distributed healthcare institutions without centralizing sensitive patient data [39,40]. This methodology allows multiple organizations to collaboratively train models while keeping patient records within their secure environments, addressing the critical privacy regulations such as HIPAA that often impede large-scale clinical NLP research. Early implementations in clinical text analysis demonstrate promising results [41], though challenges remain in handling the heterogeneity of clinical documentation across institutions and addressing the computational inefficiencies introduced by distributed optimization. Concurrently, self-supervised learning approaches have shown remarkable potential for leveraging the vast amounts of unlabeled clinical text [42,43]. By generating supervisory signals from the data itself, these methods significantly reduce the dependency on expensive manual annotations. Techniques such as clinical text reconstruction, next-sentence prediction adapted to clinical narratives, and medical entity relationship modeling have demonstrated effectiveness in capturing domain-specific knowledge. However, these approaches typically require substantial computational resources during pre-training phases, presenting implementation challenges for resource-constrained healthcare environments. The computational cost implications of these emerging methodologies vary significantly. Federated learning introduces additional computational overhead due to multiple communication rounds between participating institutions and the central server, potentially increasing training time by 30–50% compared to centralized approaches [44]. Self-supervised methods, while reducing annotation costs, typically demand 2–3 times more computational resources during pre-training than traditional supervised approaches [45]. These computational considerations are particularly relevant in clinical settings where computational infrastructure may be limited. While our MTTL-ClinicalBERT framework does not currently implement federated learning or extensive self-supervision, these approaches represent promising directions for future extensions. The adaptive knowledge distillation mechanism in our model could potentially be adapted to a federated learning context, allowing for privacy-preserving knowledge transfer between medical specialties across institutional boundaries. Similarly, our hierarchical attention architecture could be pre-trained using self-supervised objectives on unlabeled clinical text to further enhance its domain-specific knowledge prior to fine-tuning.

3. Methodology

This section presents our MTTL-ClinicalBERT framework in detail. We first provide an overview of the system architecture, followed by detailed descriptions of our three key technical contributions: an adaptive knowledge distillation mechanism, hierarchical attention architecture, and dynamic task-weighting strategy.

3.1. System Overview

MTTL-ClinicalBERT is designed to address the challenges of multi-specialty clinical text classification through a novel integration of transfer learning and multi-task learning. The framework consists of three main components that work in symmetrical harmony to achieve effective knowledge transfer and task-specific optimization. Given an input clinical text document x, the model processes it through a pre-trained clinical language model backbone, enhanced by our proposed hierarchical attention mechanism. The processed features then flow through specialty-specific branches, guided by our adaptive knowledge distillation mechanism. The final predictions are optimized using our dynamic task-weighting strategy. The overall objective function can be formulated as

L = \sum_{t = 1}^{T} w_{t} L_{t} + λ_{d} L_{k d} + λ_{r} L_{r e g},

(1)

where

w_{t}

represents the dynamic task weight for task t,

L_{t}

is the task-specific cross-entropy loss,

L_{k d}

is the knowledge distillation loss, and

L_{r e g}

is a regularization term. Specifically, the task-specific loss

L_{t}

is computed as

L_{t} = - \frac{1}{N_{t}} \sum_{i = 1}^{N_{t}} \sum_{c = 1}^{C_{t}} y_{i, c}^{t} log (p_{i, c}^{t}),

(2)

where

N_{t}

is the number of samples for task t,

C_{t}

is the number of classes,

y_{i, c}^{t}

is the ground-truth label, and

p_{i, c}^{t}

is the predicted probability.

3.2. Adaptive Knowledge Distillation

The motivation behind our adaptive knowledge distillation mechanism stems from the observation that different medical specialties often share underlying knowledge patterns. For example, cardiology and pulmonology share common terminology and diagnostic approaches due to the physiological connection between the heart and lungs. Traditional knowledge distillation approaches apply uniform transfer across all domains, which can lead to negative transfer when specialties are dissimilar. We propose an adaptive knowledge distillation mechanism that selectively transfers knowledge based on the semantic similarity between medical specialties. For any two specialties i and j, we define a similarity score

s_{i j}

as

s_{i j} = \frac{h_{i}^{T} M h_{j}}{∥ h_{i} ∥ ∥ h_{j} ∥},

(3)

where

h_{i}

and

h_{j}

are specialty-specific hidden representations obtained from the last layer of the model, and M is a learnable similarity matrix. This matrix M captures the relationships between specialty pairs and is learned during training. Some representative specialty pairs and the corresponding analysis are provided in Table 1. The final learned values are visualized in Figure 1 as the specialty knowledge transfer similarity matrix. The specialty-specific representations are computed in a symmetrically consistent manner across all domains as

h_{i} = \frac{1}{| D_{i} |} \sum_{x \in D_{i}} f_{θ} (x),

(4)

where

D_{i}

is the set of documents from specialty i, and

f_{θ} (x)

is the feature extraction function parameterized by

θ

.

Learned Specialty Similarity Evaluation

The similarity matrix M is a critical component of our model that captures the inherent relationships between different medical specialties. Rather than pre-defining fixed similarity values, we design M as a learnable parameter matrix initialized with clinically informed prior knowledge. This allows the model to discover and refine specialty relationships through the training process. For initialization, we derive initial similarity scores using a combination of multiple clinically relevant features: (1) terminology overlap measured through UMLS and SNOMED CT concept co-occurrences across specialties; (2) physiological system relationships based on medical domain knowledge; (3) procedural workflow similarities determined by analyzing common medical interventions; and (4) documentation structural patterns identified through text analysis of section headers and organization. Each feature contributes to the initial similarity score with term overlap weighted at 40%, physiological relationships at 30%, procedural similarities at 20%, and documentation patterns at 10%. During training, the similarity matrix M evolves through backpropagation, allowing the model to discover latent relationships beyond our initial medical assumptions. To ensure the learned specialty similarity remains interpretable within the medical context, we implement a regularization term that penalizes dramatic deviations from clinically plausible relationships:

L_{r e g - s i m} = λ_{s i m} \sum_{i, j} ∥ s_{i j} - s_{i j}^{i n i t} ∥^{2} \cdot 1 (| s_{i j} - s_{i j}^{i n i t} | > δ),

(5)

where

s_{i j}^{i n i t}

is the initialization value,

δ

is a threshold (set to 0.3 in our experiments), and

1 (\cdot)

is an indicator function that activates regularization only when changes exceed the threshold. To isolate and evaluate the impact of the learned specialty similarity features from other model parameters, we conducted an ablation analysis where we (1) froze the similarity matrix at initialization values, (2) allowed learning but removed the regularization term, and (3) implemented our full approach. Performance differences revealed that learned specialty similarities contributed 62% of the overall performance gain from our adaptive knowledge distillation mechanism, with the remaining 38% attributed to other parameter optimizations in the knowledge transfer process.

The similarity matrix M is a critical component of our model that captures the inherent relationships between different medical specialties. Rather than pre-defining fixed similarity values, we designed M as a learnable parameter matrix initialized with clinically informed prior knowledge. This allows the model to discover and refine specialty relationships through the training process. We further validated the medical relevance of the learned similarity matrix through expert evaluation. Three board-certified physicians reviewed the final similarity relationships, confirming that 87% of the strongest learned connections aligned with clinical expectations, while identifying 13% of unexpected relationships that suggest novel cross-specialty knowledge transfers not previously emphasized in medical education.The knowledge distillation loss is then computed as

L_{k d} = \sum_{i, j} s_{i j} K L (p_{i}^{s} | | p_{j}^{t}),

(6)

where

p_{i}^{s}

and

p_{j}^{t}

are the probability distributions from the student and teacher models, respectively. The KL divergence is computed at temperature T:

K L (p_{i}^{s} | | p_{j}^{t}) = \sum_{c} \frac{p_{i, c}^{t}}{T} log (\frac{p_{i, c}^{t} / T}{p_{i, c}^{s} / T}),

(7)

3.3. Hierarchical Attention Architecture

Clinical documents often contain both local medical terminology and global contextual information. Standard transformer architectures with fixed-size windows may miss important long-range dependencies or fail to capture precise medical terms. Our hierarchical attention architecture addresses this by processing information at multiple granularities.

The architecture consists of two levels of attention mechanisms. At the local level, we employ a fine-grained attention mechanism that focuses on medical terminology:

α_{l} = softmax (\frac{Q_{l} K_{l}^{T}}{\sqrt{d_{k}}} + M_{l}),

(8)

where

Q_{l} \in R^{n \times d_{k}}

and

K_{l} \in R^{n \times d_{k}}

are the query and key matrices, respectively, for the local attention mechanism,

d_{k}

is the dimension of the key vectors, and

M_{l}

is a medical terminology attention mask learned from a medical ontology. The query and key matrices are computed as projections of the input representations:

Q_{l} = X W_{Q}^{l}

and

K_{l} = X W_{K}^{l}

, where X is the input sequence representation and

W_{Q}^{l}, W_{K}^{l}

are learnable parameter matrices. The mask is computed as

M_{l} [i, j] = \{\begin{matrix} β \cdot sim (t_{i}, t_{j}) & if terms i, j are in ontology \\ 0 & otherwise \end{matrix}

(9)

where

sim (t_{i}, t_{j})

is the semantic similarity between terms in the medical ontology, and

β

is a learnable scaling factor.

At the global level, we implement a coarse-grained attention mechanism that captures document-level patterns:

α_{g} = softmax (\frac{Q_{g} K_{g}^{T}}{\sqrt{d_{k}}} + P),

(10)

where

Q_{g} \in R^{n \times d_{k}}

and

K_{g} \in R^{n \times d_{k}}

are the query and key matrices for the global attention mechanism (computed similarly to the local mechanism but with different parameter matrices:

Q_{g} = X W_{Q}^{g}

and

K_{g} = X W_{K}^{g}

), and P represents learned positional encodings that help capture the document structure. The final representation combines both attention levels:

h = β_{l} α_{l} V_{l} + β_{g} α_{g} V_{g},

(11)

where

V_{l} \in R^{n \times d_{v}}

and

V_{g} \in R^{n \times d_{v}}

are the value matrices for local and global attention mechanisms, respectively, computed as

V_{l} = X W_{V}^{l}

and

V_{g} = X W_{V}^{g}

with learnable parameters

W_{V}^{l}

and

W_{V}^{g}

. The parameters

β_{l}

and

β_{g}

are learnable weights that balance the contribution of local and global information. These parameters are initialized to 0.5 and updated during training.

3.4. Dynamic Task-Weighting Strategy

Different medical specialties vary in their complexity and data availability. A static weighting of tasks can lead to suboptimal performance. Our dynamic task-weighting strategy automatically adjusts the importance of each task based on its difficulty and learning progress, ensuring symmetrical resource allocation across specialties.

For each task t, we compute a dynamic weight

w_{t}

as

w_{t} = \frac{exp (γ_{t} / τ)}{\sum_{k = 1}^{T} exp (γ_{k} / τ)},

(12)

where

γ_{t}

represents the task uncertainty estimated from the validation loss gradient magnitude:

γ_{t} = {∥ \nabla_{θ} L_{t}^{v a l} ∥}_{2},

(13)

and

τ

is a temperature parameter that controls the sharpness of the weight distribution. The gradient magnitude is computed over a moving window of W training steps:

∥ \nabla_{θ} L_{t}^{v a l} ∥_{2} = \sqrt{\frac{1}{W} \sum_{i = t - W + 1}^{t} {∥ \nabla_{θ} L_{i}^{v a l} ∥}_{2}^{2}}

(14)

To prevent rapid fluctuations in task weights, we apply exponential moving average smoothing:

{\hat{w}}_{t} = α w_{t} + (1 - α) {\hat{w}}_{t - 1},

(15)

where

α

is the smoothing factor (set to 0.9 in our experiments).

The regularization term

L_{r e g}

in the overall objective function includes L2 regularization on model parameters and an entropy term to encourage diverse predictions:

L_{r e g} = λ_{1} {∥ θ ∥}_{2}^{2} - λ_{2} \sum_{t = 1}^{T} H (p_{t}),

(16)

where

H (p_{t})

is the entropy of the prediction distribution for task t, and

λ_{1}, λ_{2}

are hyperparameters controlling the strength of each regularization term.

The combined effect of these three components enables MTTL-ClinicalBERT to effectively leverage shared knowledge across medical specialties while maintaining specialty-specific expertise. The adaptive knowledge distillation ensures relevant knowledge transfer, the hierarchical attention captures both detailed and broad patterns in clinical texts, and the dynamic task-weighting ensures optimal learning across all specialties.

4. Experiments

4.1. Dataset

We conducted our experiments on the MTSamples [21] dataset, which contains 4998 medical transcriptions across various medical specialties. After preprocessing and filtering, we retained documents from the five most common specialties, surgery (22%), cardiovascular/pulmonary (15%), orthopedic (12%), radiology (11%), and neurology (10%), as shown in Table 2. The remaining 30% consists of other specialties. To ensure robust evaluation, we split the dataset into training (70%), validation (15%), and test (15%) sets, maintaining the original specialty distribution in each split. The detailed data preprocessing and filtering process are provided as follows.

4.1.1. Data Preprocessing and Filtering

The MTSamples dataset requires careful preprocessing to ensure optimal model performance while preserving the clinical significance of the text. Our preprocessing pipeline begins with document filtering, where we included only complete documents with clearly defined specialty labels, removing approximately 3% of the original corpus with ambiguous or missing specialty classifications. This filtering step ensures reliable ground truth for our classification task. Although MTSamples provides de-identified data, we implemented an additional anonymization pass using regular expressions and named entity recognition to identify and replace any potential protected health information (PHI) that might have been overlooked, ensuring HIPAA compliance throughout our dataset. For text normalization, we applied clinical-specific techniques tailored to the unique characteristics of medical documentation. We standardized medical abbreviations using a comprehensive dictionary of over 5000 clinical acronyms and their expanded forms, converting abbreviated terms like “HTN” to their full form “hypertension”. We normalized numerical values and units to create consistency across different representation styles commonly found in clinical notes. Section headers were standardized across specialties to facilitate better section-level comparison, and we carefully removed special characters and formatting artifacts from the transcription process while preserving medically significant symbols. To address the inherent imbalance in specialty distribution, we initially considered upsampling minority specialties or downsampling majority ones. However, after experimentation, we found that our dynamic task-weighting strategy effectively addressed this imbalance without artificial sampling, allowing us to preserve the natural distribution of the dataset while ensuring fair learning across all specialties. For tokenization and sequence preparation, documents were processed using the ClinicalBERT tokenizer with a maximum sequence length of 512 tokens. For documents exceeding this length, we implemented a sliding window approach with a 128-token overlap between windows to preserve context continuity, with final classification determined by majority voting across windows. This approach ensured that longer clinical documents were properly represented in our model while respecting the token limitations of transformer architectures. These preprocessing steps were specifically designed to address the unique challenges of clinical text while preserving domain-specific information critical for accurate classification. This approach resulted in a clean, standardized dataset that maintains the authentic characteristics of clinical documentation across different specialties.

4.1.2. Dataset Limitations and Potential Biases

While the MTSamples dataset provides valuable clinical text for our experiments, it is important to acknowledge several limitations and potential biases that may affect the generalizability of our results. The dataset consists of 4998 medical transcriptions, which, while substantial for NLP research, represents only a fraction of the diversity found in real-world clinical documentation. The sample size for certain specialties, particularly neurology (499 documents) and radiology (549 documents), is relatively limited compared to surgery (1099 documents). This imbalance in specialty representation may introduce bias toward the classification patterns of better-represented specialties. Our dynamic task-weighting strategy was specifically designed to mitigate this imbalance, but the underlying data distribution remains a limitation. The MTSamples dataset is composed of transcribed medical dictations primarily created for educational and demonstration purposes, which may not fully capture the variability, inconsistency, and complexity of actual clinical documentation in practice. Real-world clinical notes often contain more abbreviations, institutional-specific terminology, and domain-specific shorthand that varies significantly across healthcare systems. To assess this limitation, we conducted a small-scale validation against 200 de-identified clinical documents from a local healthcare institution (with appropriate IRB approval) and found that classification performance decreased by approximately 3.8% compared to our reported results on MTSamples, suggesting the need for additional fine-tuning when deploying in specific clinical environments. Additionally, the dataset does not provide demographic information about the dictating clinicians or the patients described in the texts, making it impossible to assess or address potential biases related to provider experience, patient demographics, or geographic practice variations. These factors may significantly influence documentation style, terminology use, and clinical focus, potentially limiting the model’s performance when applied to underrepresented populations or practice settings. The temporal aspect of the dataset also presents a limitation, as medical terminology, standard practices, and documentation requirements evolve over time. The MTSamples dataset does not include creation dates for individual documents, preventing analysis of temporal consistency or the model’s ability to adapt to evolving clinical language. To partially address this concern, we manually reviewed a random subset of 100 documents and found terminology patterns consistent with practices from approximately 2010 to 2018, suggesting that the dataset is relatively contemporary but may not reflect the most recent clinical documentation practices. To mitigate these limitations in our experimental design, we implemented stratified sampling to maintain specialty distribution across train/validation/test splits, applied specialty-specific evaluation metrics to provide transparent performance assessment across all domains, and incorporated the dynamic task-weighting strategy that adapts to the relative difficulty and data availability for each specialty. These methodological choices help reduce the impact of dataset biases on our reported results, though the fundamental limitations of the dataset should be considered when interpreting the generalizability of our findings to diverse clinical settings.

4.2. Baselines

We compared MTTL-ClinicalBERT with several strong baseline models, all evaluated under identical experimental conditions using the same train/validation/test splits and preprocessing steps:

BioBERT [16]: A BERT model pre-trained on biomedical literature, specifically designed for biomedical text mining tasks.
ClinicalBERT [15]: A BERT variant specifically trained on clinical notes from the MIMIC-III database.
Clinical-Longformer [33]: A long-document transformer model adapted for clinical text with specialized attention mechanisms.
BlueBERT [46]: A BERT model jointly pre-trained on PubMed abstracts and clinical notes, optimized for medical domain tasks.
MT-BERT [20]: A multi-task BERT model with static task weighting for clinical text understanding.
BioClinicalBERT [15]: A modified version of ClinicalBERT incorporating biomedical knowledge.
MedBERT [42]: A domain-specific model pre-trained on structured EHR data for clinical tasks.

For all baseline models, we used the officially released implementations and pre-trained weights. We fine-tuned each model on our dataset following the recommended hyperparameter settings from their respective papers.

4.3. Evaluation Metrics

We evaluated model performance using the following metrics:

Accuracy: Overall classification accuracy across all specialties.
Macro F1-score: Average F1-score across specialties, giving equal weight to each specialty.
Weighted F1-score: F1-score weighted by the number of samples in each specialty.
AUC-ROC: Area under the receiver operating characteristic curve.

4.4. Implementation Details

We implemented MTTL-ClinicalBERT using PyTorch 1.9.0 and conducted all experiments on a cluster equipped with 4 NVIDIA V100 GPUs (32GB memory each). The model was initialized with pre-trained ClinicalBERT weights [15] and further enhanced with our proposed components: adaptive knowledge distillation, hierarchical attention architecture, and dynamic task-weighting strategy. The model training employed the AdamW optimizer with a learning rate of 2 × 10⁻⁵ and a batch size of 32. We processed input sequences up to 512 tokens and implemented early stopping with a patience of 5 epochs monitoring the validation loss. The temperature parameter

τ

in our dynamic task-weighting strategy was set to 0.5. For data preprocessing, we applied standard clinical text cleaning procedures including removal of personal identifiers, normalization of medical abbreviations, standardization of numerical values, and special character handling. Model hyperparameters were carefully tuned through grid search on the validation set, exploring learning rates between 1 × 10⁻⁵ and 3 × 10⁻⁵, batch sizes of 16, 32, and 64, warmup steps ranging from 0 to 2000, and weight decay values of 0.01 and 0.1. To ensure statistical significance and reproducibility, all experiments were repeated five times with different random seeds, and we report the mean and standard deviation of the results.

4.5. Results and Analysis

4.5.1. Overall Performance

Table 3 presents the comprehensive performance comparison between MTTL-ClinicalBERT and baseline models. A detailed analysis reveals several interesting patterns in model behavior and performance characteristics.

The baseline models demonstrate varying capabilities in handling clinical text classification tasks, but all suffer from asymmetrical performance across specialties. BioBERT and BlueBERT, despite their extensive pre-training on biomedical literature (PubMed abstracts and PMC articles), achieve relatively modest performance (82.3% and 83.5% accuracy respectively). Through detailed error analysis, we find these models particularly struggle with clinical abbreviations and informal symptom descriptions. For instance, when encountering common clinical abbreviations like “HTN” (hypertension) or “SOB” (shortness of breath), BioBERT shows a 15.3% error rate. This limitation stems from the fundamental gap between the formal language of biomedical literature and the practical, often informal nature of clinical notes. ClinicalBERT, with its specialized training on clinical notes, demonstrates notable improvement (84.7% accuracy) over the general biomedical models. Its strength particularly manifests in understanding standard clinical workflows and common medical scenarios, showing a 12.4% improvement over BioBERT in identifying procedure-related content. However, our analysis reveals a significant weakness in handling complex cases involving multiple specialties. When processing documents containing cross-specialty information (e.g., a surgical report with detailed cardiac complications), ClinicalBERT’s performance drops by 9.2%, suggesting limitations in integrating knowledge across medical domains. Clinical-Longformer represents a significant advancement, achieving 85.9% accuracy through its ability to process longer sequences. Our investigation shows its particular strength in maintaining contextual coherence across extended documents, with 92.3% accuracy in key information extraction for documents exceeding 1000 tokens. However, the model exhibits a notable weakness in integrating information across different sections of clinical notes, with an 18.7% error rate when cross-referencing between patient history and current symptoms. This suggests that while the model can process longer sequences, it sometimes fails to establish meaningful connections between distant but related pieces of information.

To further visualize the performance advantages of MTTL-ClinicalBERT, Figure 2 presents the ROC curves for our model compared to the baseline approaches. Figure 2a illustrates that MTTL-ClinicalBERT consistently achieves higher true positive rates across all false positive rate thresholds, with particularly notable improvements in the critical low-false-positive-rate region (0.0–0.2). This advantage is essential in clinical applications where false positives can lead to unnecessary follow-up procedures or treatments. Figure 2b demonstrates that our model maintains this superior performance across all individual specialties, with the most substantial improvements observed in surgery and cardiovascular/pulmonary specialties. The area under these curves numerically confirms the performance advantage, with MTTL-ClinicalBERT achieving an overall AUC of 0.946, representing a 2.3% improvement over the best baseline model (MT-BERT with AUC 0.925). This consistent performance across different operating thresholds demonstrates the robustness of our approach and its suitability for clinical deployment scenarios where threshold calibration may vary based on specific use cases and risk tolerances.

To further understand these performance patterns, we conducted a detailed analysis of model behavior across different clinical content types, as shown in Table 4. This detailed error analysis reveals several key insights about model behavior. First, all models perform best on standard medical terminology, with error rates typically below 8%. However, performance degrades significantly when handling abbreviated terms or complex cases involving multiple conditions. Notably, MTTL-ClinicalBERT maintains relatively consistent performance across these categories, with error rate increases of only 3.3% from standard to complex cases, compared to 7.5–12.4% increases seen in baseline models. The superior performance of MTTL-ClinicalBERT (89.4% accuracy, 87.8% macro F1-score) stems from its ability to address these various challenges through its integrated approach. The model shows particular strength in three key areas (1) consistent handling of both standard and abbreviated terms through its hierarchical attention mechanism, (2) effective integration of cross-specialty information via adaptive knowledge distillation, and (3) balanced performance across different content types through dynamic task weighting. The improvement is most pronounced in complex cases, where MTTL-ClinicalBERT maintains performance within 5% of its baseline on standard cases, while other models show degradation of 10–15%.

4.5.2. Comparison with Published Results

The comparison with published results on related clinical text classification tasks (Table 5) provides additional validation of our approach. While direct dataset comparisons must be interpreted carefully due to different task formulations, our consistent outperformance across multiple clinical datasets demonstrates that the improvements are not specific to MTSamples. On the MIMIC-III and i2b2 2010 clinical classification tasks, MTTL-ClinicalBERT achieves improvements of 1.7% and 2.5%, respectively, over the best published results, indicating that our adaptive knowledge distillation mechanism provides benefits across diverse clinical text classification scenarios.

4.5.3. Ablation Study

To thoroughly understand the contribution of each component in MTTL-ClinicalBERT, we conducted a comprehensive ablation study analyzing both overall performance impact and component-specific behaviors. Table 6 presents the primary ablation results, which demonstrate the relative importance of each architectural component.

The removal of the adaptive knowledge distillation mechanism results in the most significant performance degradation (−1.8% in accuracy), highlighting its crucial role in facilitating knowledge transfer between related specialties. This decline is particularly pronounced in specialties with limited training data, such as neurology and radiology, where the accuracy drops by up to 4.2%. The substantial impact of knowledge distillation aligns with our subsequent analysis of specialty similarities (as shown in Figure 1), where we observe strong knowledge transfer patterns between related specialties like cardiovascular/pulmonary and surgery.

The hierarchical attention architecture’s removal leads to a −1.5% overall performance drop, with its impact most pronounced in handling long documents and complex cases. Detailed analysis reveals that without this component, the model’s ability to capture both local medical terminology and global document context is significantly impaired. For instance, in surgical reports containing multiple procedure descriptions, the classification accuracy drops by 2.3%. This finding suggests that the hierarchical attention mechanism is particularly valuable for maintaining semantic coherence across different sections of clinical notes.

The dynamic weighting strategy, while showing the smallest overall impact (−1.1% accuracy), plays a crucial role in handling imbalanced specialty distributions. Without this component, performance on minority specialties drops by an average of 1.8%, while majority specialties see only a 0.5% decrease. This asymmetric impact demonstrates the strategy’s effectiveness in balancing learning across specialties with varying data availability.

4.5.4. Performance on Different Specialties

Analysis across medical specialties reveals varying degrees of improvement, as illustrated in Figure 3. The vertical bars clearly demonstrate MTTL-ClinicalBERT’s consistent outperformance across all specialties compared to the best baseline model, with particularly notable improvements in certain domains.

Surgery and cardiovascular/pulmonary specialties demonstrate the strongest performance gains (4.2% and 3.8%, respectively), likely due to several factors revealed in our analysis. First, these specialties often share overlapping terminology and procedures, allowing our model to leverage cross-specialty knowledge transfer effectively. This hypothesis is supported by the high similarity score (0.83) observed between these specialties in our knowledge transfer analysis (Figure 1). Moreover, these specialties typically contain well-structured documentation patterns, which our hierarchical attention mechanism can effectively capture.

Radiology shows more modest improvements (2.1% gain), which our analysis attributes to the highly standardized nature of radiological reports. The relatively lower performance gain in this specialty suggests that existing models already perform reasonably well on standardized formats, leaving less room for improvement through our advanced architecture. Nevertheless, MTTL-ClinicalBERT still achieves meaningful gains through better handling of cases where radiological findings need to be interpreted in the context of other specialties.

Neurology and orthopedics show intermediate improvements (3.1% and 2.9%, respectively). The performance patterns in these specialties are particularly interesting when viewed alongside our similarity matrix results (Figure 1), which show moderate knowledge transfer between these domains (similarity score 0.62). This moderate similarity suggests that while these specialties share some common ground (particularly in musculoskeletal and neurological conditions), they maintain distinct characteristics that require specialized processing.

A detailed examination of the error patterns across specialties reveals interesting correlations between error types, improvement margins, and knowledge transfer capabilities. From Table 7, surgery and cardiovascular/pulmonary, despite showing the lowest error rates (8.2% and 7.8%, respectively), achieve the highest performance improvements. This seemingly counterintuitive result can be explained by their high knowledge transfer scores (0.83), suggesting that the model effectively leverages shared knowledge to further reduce already low error rates. The primary errors in these specialties stem from terminology overlap, particularly in cases where similar terms carry specialty-specific meanings.

The orthopedic specialty shows a moderate error rate of 9.4%, with most errors occurring in condition specificity classification. This challenges primarily arises in cases where musculoskeletal conditions have multiple potential specialty associations. The medium knowledge transfer score (0.62) indicates that while the model can leverage some cross-specialty information, the distinct nature of orthopedic terminology limits the benefits of knowledge sharing.

Radiology presents an interesting case with the highest error rate (10.2%) but the smallest improvement margin (+2.1%). The low knowledge transfer score (0.55) suggests that radiological reports, despite their structured nature, benefit less from cross-specialty knowledge transfer. The errors primarily occur in report structure interpretation, particularly when integrating findings with clinical context from other specialties.

Neurology, with an 8.8% error rate and medium knowledge transfer score (0.62), shows substantial improvement (+3.1%) despite dealing with complex cases. This suggests that our model effectively handles the intricate nature of neurological descriptions while benefiting moderately from knowledge shared with related specialties, particularly orthopedics in cases involving the nervous system’s interaction with musculoskeletal conditions.

4.5.5. Low-Resource Scenario Analysis

To evaluate model robustness under limited data conditions, we conducted extensive experiments varying the amount of available training data. Figure 4 illustrates the performance trends of MTTL-ClinicalBERT compared to the best baseline model under different data availability scenarios.

With only 30% of training data, MTTL-ClinicalBERT maintains 85.6% accuracy, significantly outperforming Clinical-Longformer (78.3%) and MT-BERT (76.9%). The performance degradation curve shows interesting non-linear behavior, with an unexpected dip at 50% training data (83.2%) followed by strong recovery at 70% (86.8%). This pattern differs markedly from the baseline models’ near-linear degradation, suggesting more complex learning dynamics in our approach.

To better understand this behavior, we conducted a detailed analysis across different specialties under low-resource conditions, and the results are provided in Table 8.

The performance variations across specialties reveal several interesting patterns. Specialties with stronger knowledge transfer relationships (as identified in our similarity matrix) show better resilience to data reduction. For instance, surgery and cardio/pulmonary maintain relatively high performance (86.2% and 85.8%, respectively) even with only 30% data, benefiting from their strong knowledge-sharing capabilities (similarity score 0.83).

We further analyzed the model’s learning behavior under limited data through stability metrics, as shown in Table 9.

4.5.6. Statistical Significance Analysis

To rigorously validate the improvements achieved by MTTL-ClinicalBERT, we conducted statistical significance testing across all experiments. Table 10 presents the 95% confidence intervals for the performance metrics of our model compared to the best baseline. We employed paired bootstrap resampling with 1000 iterations to establish these intervals, following the methodology recommended by [47] for NLP model evaluation. The results demonstrate that MTTL-ClinicalBERT consistently achieves statistically significant improvements across all metrics. The non-overlapping confidence intervals for accuracy (

89.4 % \pm 0.2 %

vs.

86.2 % \pm 0.3 %

), macro F1-score (

87.8 % \pm 0.2 %

vs.

84.5 % \pm 0.3 %

), and weighted F1-score (

88.3 % \pm 0.2 %

vs.

85.1 % \pm 0.2 %

) confirm the robustness of our approach. The most substantial and statistically significant improvements are observed in the surgery and cardiovascular/pulmonary specialties, with p-values < 0.001 in both cases.

4.5.7. Performance in Extreme Clinical Scenarios

Beyond standard evaluation settings, we conducted additional experiments to assess MTTL-ClinicalBERT’s viability in challenging clinical scenarios that better reflect real-world implementation challenges. Table 11 summarizes these findings. In minimal digitization environments, where only 10% of training data is available, MTTL-ClinicalBERT maintains 78.3% accuracy, significantly outperforming the best baseline (65.2%). This robustness can be attributed to our adaptive knowledge distillation mechanism, which effectively leverages cross-specialty information even with extremely limited data. To evaluate performance under high data heterogeneity, we artificially introduced variability in documentation styles by applying random terminology substitutions and structural modifications to 30% of the test set. Even in this challenging scenario, our model achieves 82.1% accuracy, compared to 72.5% for the best baseline. The hierarchical attention architecture demonstrates particular resilience to structural variations, maintaining semantic understanding despite syntactic differences. We also conducted cross-institution validation using a synthetic approach to simulate documentation differences between healthcare systems. Following the methodology proposed by [9], we modified test documents to reflect institutional variations in formatting, abbreviation usage, and section ordering. MTTL-ClinicalBERT maintains 81.5% accuracy in this setting, significantly outperforming the baseline (74.2%). These results demonstrate that our approach is particularly well-suited for realistic clinical implementations where data limitations and heterogeneity present significant challenges. The combination of adaptive knowledge distillation and hierarchical attention provides robustness in scenarios closely resembling actual healthcare environments with varying levels of digitization and standardization.

4.5.8. Knowledge Transfer Analysis

The specialty similarity matrix reveals sophisticated patterns in knowledge transfer between different medical domains, as visualized in Figure 1, demonstrating how symmetrical information exchange enhances performance across specialties. This analysis provides crucial insights into how different specialties interact and share information within our model architecture. The strongest bidirectional knowledge transfer is observed between cardiovascular/pulmonary and surgery (similarity score 0.83 ± 0.02), reflecting their frequent clinical overlap. This high similarity manifests in shared terminology, procedural descriptions, and patient care protocols. To quantify the practical impact of these relationships, we conducted an in-depth analysis of knowledge transfer patterns, as shown in Table 12. Interestingly, we observe asymmetric transfer effects between certain specialty pairs. For example, knowledge transfer from surgery to radiology shows a stronger positive impact (+4.8% accuracy improvement) than the reverse direction (+3.2%). This asymmetry likely reflects the hierarchical nature of medical knowledge, where certain specialties provide more generalizable information than others. To further understand the temporal aspects of knowledge transfer, we analyzed how these relationships evolve during training and the results are shown in Table 13. This temporal analysis reveals that knowledge transfer patterns strengthen progressively during training, with the most significant improvements occurring in the later stages. The gradual increase in both similarity scores and transfer rates suggests that the model first learns specialty-specific features before developing more sophisticated cross-specialty knowledge-sharing mechanisms.

4.5.9. Cross-Domain Evaluation on Standard Datasets

To demonstrate the broader applicability of our adaptive knowledge distillation mechanism beyond clinical text classification, we conducted additional experiments on three standard text classification datasets widely used in the literature. Following Howard and Ruder [17], we evaluated our approach on IMDB (sentiment analysis), AG News (topic classification), and DBpedia (ontology classification) datasets.

Datasets and Experimental Setup

The IMDB dataset contains 50,000 movie reviews labeled as having positive or negative sentiment. AG News consists of 120,000 news articles categorized into four classes: World, Sports, Business, and Sci/Tech. DBpedia is an ontology classification dataset with 560,000 articles across 14 categories. We followed the same train/validation/test splits as Howard and Ruder [17] for fair comparison. For these experiments, we adapted our MTTL-ClinicalBERT framework to general domain datasets while maintaining the core adaptive knowledge distillation mechanism. Given that these datasets address different classification tasks (rather than medical specialties), we modified our similarity calculation to measure task relationships based on latent semantic features.

Results and Analysis

Table 14 presents the classification accuracy of MTTL-ClinicalBERT compared to ULMFiT and other baseline models across the three datasets.

Our adaptive knowledge distillation mechanism demonstrates consistent improvements over ULMFiT’s uniform transfer approach across all three datasets. The gains are most pronounced on the AG News dataset (+0.6%), which contains distinct yet related categories that benefit from selective knowledge sharing. The performance improvement on IMDB (+0.5%) indicates that our approach effectively captures sentiment relationships even in binary classification tasks. To better understand how knowledge transfer patterns differ between clinical and general domains, we analyzed the learned similarity matrix for the AG News dataset. Figure 5 visualizes the knowledge transfer relationships between the four news categories.

Unlike in the clinical domain where relationships are guided by medical specialty overlaps, the general domain knowledge transfer exhibits different patterns based on content similarity. Business and World categories showed the strongest knowledge sharing (similarity score 0.72), likely due to overlapping economic and political content. Sports demonstrated the most distinct representation with limited knowledge transfer from other categories (average similarity score 0.42). These results validate that our adaptive knowledge distillation mechanism effectively generalizes beyond the clinical domain, outperforming uniform knowledge transfer approaches on standard text classification benchmarks. The consistent improvements across diverse datasets demonstrate the broader applicability of our approach to various text classification tasks.

4.5.10. Transferability of Proposed Mechanisms

To evaluate whether our proposed mechanisms provide consistent benefits regardless of the underlying pre-trained model, we conducted experiments applying our three key contributions (adaptive knowledge distillation, hierarchical attention architecture, and dynamic task-weighting strategy) to different base models. Table 15 presents the results of these experiments.

The results demonstrate that our proposed mechanisms consistently improve performance across all base models tested. Notably, BioBERT shows the largest relative improvement (+5.5%), suggesting that our approach effectively compensates for its lower baseline performance on clinical text. Clinical-Longformer and MT-BERT, which have stronger baseline performance, still show substantial gains of +4.2% and +4.1%, respectively. While the absolute performance varies based on the starting point, the consistent improvement across diverse architectures confirms the generalizable nature of our contributions. ClinicalBERT with our mechanisms (MTTL-ClinicalBERT) achieves the best overall performance, which justifies our choice for the primary model presented in this paper. However, the strong performance of enhanced Clinical-Longformer (90.1%) suggests it could be an attractive alternative for applications requiring long document processing.

4.6. Reproducibility

To ensure the reproducibility of our results, we have made our complete implementation code publicly available at https://drive.google.com/file/d/1wgv-WDMC34POK0G2Ws9s7zrzd_38Ox6c/view?usp=share_link (accessed on 1 April 2025). For dataset preparation, we provide detailed documentation of our preprocessing pipeline for the MTSamples dataset. The preprocessing steps include the following:

1.: Data acquisition from the original MTSamples source (https://www.mtsamples.com/ (accessed on 1 April 2025)).
2.: Specialty filtering: We retained documents from the five most common specialties (surgery, cardiovascular/pulmonary, orthopedic, radiology, and neurology) and grouped the remainder as “Others”.
3.: Text normalization: We applied standard clinical text cleaning procedures including removal of personal identifiers, normalization of medical abbreviations, standardization of numerical values, and special character handling.
4.: Data splitting: We used stratified sampling to create train (70%), validation (15%), and test (15%) sets while maintaining the original specialty distribution.

5. Conclusions and Future Work

In this paper, we presented MTTL-ClinicalBERT, a novel multi-task transfer learning framework for clinical text classification that effectively addresses the challenges of multi-specialty medical document categorization. Through comprehensive experiments, we demonstrated that our symmetrically balanced approach achieves significant improvements over existing methods, with an average increase of 3.2% in accuracy and 3.3% in macro F1-score across different medical specialties. Furthermore, our experiments demonstrate that the three key components of MTTL-ClinicalBERT—adaptive knowledge distillation, hierarchical attention architecture, and dynamic task-weighting strategy—provide consistent benefits when applied to various pre-trained models, suggesting the broad applicability of our approach beyond any specific backbone architecture. The success of our approach can be attributed to three key innovations. The adaptive knowledge distillation mechanism effectively facilitates knowledge transfer between related medical specialties, as evidenced by strong performance improvements in specialties with high similarity scores. The hierarchical attention architecture successfully captures both local medical terminology and global document context, particularly beneficial for handling long clinical documents with complex cross-references. Additionally, the dynamic task-weighting strategy effectively balances learning across specialties with varying data availability, maintaining robust performance even in low-resource scenarios.

Future Validation in Clinical Environments

To transition MTTL-ClinicalBERT from research to practical clinical applications, we propose a comprehensive validation plan addressing the unique challenges of real-world healthcare environments.

Our approach to regulatory compliance centers on implementing a federated deployment framework that maintains all patient data within institutional boundaries, ensuring compliance with HIPAA and similar international regulations. The model architecture will be adapted to support distributed training and inference, with only non-sensitive model parameters shared across institutions. We will collaborate with institutional review boards at three partner hospitals to establish compliant validation protocols, following the framework established by [48] for cross-institutional machine learning in healthcare.

For data confidentiality, our validation will employ differential privacy techniques with guaranteed privacy bounds, limiting the risk of patient re-identification while maintaining model performance. Specifically, we will implement the gradient perturbation approach proposed by [49], calibrating the privacy–utility trade-off for clinical applications. Initial simulations indicate that our model can maintain over 90% of its performance while providing

ϵ

-differential privacy guarantees with

ϵ < 5

.

To address documentation heterogeneity across institutions, we will extend our hierarchical attention mechanism with an institution-specific adaptation layer that learns to normalize varying documentation styles. This approach, inspired by domain adaptation techniques in NLP [50], will be validated across institutions with distinct EHR systems (Epic, Cerner, and AllScripts) to assess robustness to documentation variations. Additionally, we will develop a terminology mapping module that aligns institution-specific abbreviations and terms with standardized medical ontologies.

We have designed a three-phase validation protocol beginning with a shadow deployment phase lasting 3 months, where the model will run in parallel with existing workflows at partner institutions, with performance monitored without influencing clinical decisions. This phase will focus on adaptation to institutional documentation patterns and terminology. Next, a limited integration phase of 6 months will follow, where the model will be integrated into clinical workflows for specific departments, with human oversight for all classifications. This phase will assess performance in prospective real-time scenarios across varying clinical contexts. Finally, a comparative evaluation phase of 12 months will involve a randomized controlled trial comparing clinical workflow efficiency, documentation quality, and diagnostic coding accuracy between departments using our system and control groups.

To demonstrate adaptability across diverse hospital contexts, we will validate on three distinct healthcare settings: a large academic medical center with comprehensive documentation, a mid-sized community hospital with mixed paper and electronic records, and a rural hospital network with limited structured data. Performance metrics will be stratified by institution type, specialty, and documentation completeness to identify potential gaps requiring further refinement.

This validation plan specifically addresses the practical implementation challenges in clinical environments while establishing a framework for responsible AI deployment in healthcare settings. The resulting insights will guide future refinements to the MTTL-ClinicalBERT architecture and training methodology, potentially leading to a clinically validated system suitable for widespread deployment across diverse healthcare institutions.

Author Contributions

Methodology, Q.Z. and S.C.; Software, Q.Z.; Validation, S.C.; Writing—original draft, Q.Z., S.C. and W.L.; Writing—review & editing, Q.Z.; Visualization, S.C.; Supervision, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Meystre, S.M.; Savova, G.K.; Kipper-Schuler, K.C.; Hurdle, J.F. Extracting information from textual documents in the electronic health record: A review of recent research. Yearb. Med. Inform. 2008, 17, 128–144. [Google Scholar]
Wang, Y.; Wang, L.; Rastegar-Mojarad, M.; Moon, S.; Shen, F.; Afzal, N.; Liu, S.; Zeng, Y.; Mehrabi, S.; Sohn, S.; et al. Clinical information extraction applications: A literature review. J. Biomed. Inform. 2018, 77, 34–49. [Google Scholar] [CrossRef]
Wei, W.Q.; Denny, J.C. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med. 2015, 7, 1–14. [Google Scholar] [CrossRef]
Chapman, W.W.; Nadkarni, P.M.; Hirschman, L.; D’avolio, L.W.; Savova, G.K.; Uzuner, O. Overcoming barriers to NLP for clinical text: The role of shared tasks and the need for additional creative solutions. J. Am. Med. Inform. Assoc. 2011, 18, 540–543. [Google Scholar] [CrossRef]
Spyns, P. Natural language processing in medicine: An overview. Methods Inf. Med. 1996, 35, 285–301. [Google Scholar] [CrossRef]
Act, A. Health insurance portability and accountability act of 1996. Public Law 1996, 104, 191. [Google Scholar]
Uzuner, Ö.; Luo, Y.; Szolovits, P. Evaluating the state-of-the-art in automatic de-identification. J. Am. Med. Inform. Assoc. 2007, 14, 550–563. [Google Scholar] [CrossRef]
Yuan, H.; Yu, K.; Xie, F.; Liu, M.; Sun, S. Automated machine learning with interpretation: A systematic review of methodologies and applications in healthcare. Med. Adv. 2024, 2, 205–237. [Google Scholar] [CrossRef]
Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.W.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 1–9. [Google Scholar] [CrossRef]
Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 2002, 34, 1–47. [Google Scholar] [CrossRef]
Bernhardt, P.J.; Humphrey, S.M.; Rindflesch, T.C. Determining prominent subdomains in medicine. AMIA Annu. Symp. Proc. 2005, 2005, 46. [Google Scholar] [PubMed]
Yuan, J.; Holtz, C.; Smith, T.; Luo, J. Autism spectrum disorder detection from semi-structured and unstructured medical data. EURASIP J. Bioinform. Syst. Biol. 2016, 2017, 1–9. [Google Scholar] [CrossRef]
Weng, W.H.; Wagholikar, K.B.; McCray, A.T.; Szolovits, P.; Chueh, H.C. Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach. BMC Med. Inform. Decis. Mak. 2017, 17, 1–13. [Google Scholar] [CrossRef] [PubMed]
Yao, L.; Mao, C.; Luo, Y. Clinical text classification with rule-based features and knowledge-guided convolutional neural networks. BMC Med. Inform. Decis. Mak. 2019, 19, 31–39. [Google Scholar] [CrossRef]
Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly available clinical BERT embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv 2018, arXiv:1801.06146. [Google Scholar]
Palabindala, V.; Pamarthy, A.; Jonnalagadda, N.R. Adoption of electronic health records and barriers. J. Community Hosp. Intern. Med. Perspect. 2016, 6, 32643. [Google Scholar] [CrossRef]
Percha, B. Modern clinical text mining: A guide and review. Annu. Rev. Biomed. Data Sci. 2021, 4, 165–187. [Google Scholar] [CrossRef] [PubMed]
Peng, Y.; Chen, Q.; Lu, Z. An empirical study of multi-task learning on BERT for biomedical text mining. arXiv 2020, arXiv:2005.02799. [Google Scholar]
Mt Samples—Medical Transcription Samples. 2020. Available online: https://www.mtsamples.com/ (accessed on 1 April 2025).
Hsu, E.; Malagaris, I.; Kuo, Y.F.; Sultana, R.; Roberts, K. Deep learning-based NLP data pipeline for EHR-scanned document information extraction. JAMIA Open 2022, 5, ooac045. [Google Scholar] [CrossRef] [PubMed]
Guerra-Manzanares, A.; Lopez, L.J.L.; Maniatakos, M.; Shamout, F.E. Privacy-preserving machine learning for healthcare: Open challenges and future perspectives. In International Workshop on Trustworthy Machine Learning for Healthcare; Springer Nature: Cham, Switzerland, 2023; pp. 25–40. [Google Scholar]
Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 1998; pp. 137–142. [Google Scholar]
McCallum, A.; Nigam, K. A comparison of event models for naive bayes text classification. AAAI-98 Workshop Learn. Text Categ. 1998, 752, 41–48. [Google Scholar]
Graves, A.; Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Graves, A.; Mohamed, A.R.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 6645–6649. [Google Scholar]
Chen, Y. Convolutional Neural Network for Sentence Classification. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2015. [Google Scholar]
Kim, Y.; Jernite, Y.; Sontag, D.; Rush, A. Character-aware neural language models. In Proceedings of the AAAI Conference on Artificial Intelligence 2016, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Liu, Z.; Lin, W.; Shi, Y.; Zhao, J. A robustly optimized BERT pre-training approach with post-training. In China National Conference on Chinese Computational Linguistics; Springer International Publishing: Cham, Switzerland, 2021; pp. 471–484. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Li, Y.; Wehbe, R.M.; Ahmad, F.S.; Wang, H.; Luo, Y. Clinical-longformer and clinical-bigbird: Transformers for long clinical sequences. arXiv 2022, arXiv:2201.11838. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
Neumann, M.; King, D.; Beltagy, I.; Ammar, W. ScispaCy: Fast and robust models for biomedical natural language processing. arXiv 2019, arXiv:1902.07669. [Google Scholar]
Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.M.; Hajaj, N.; Hardt, M.; Liu, P.J.; Liu, X.; Marcus, J.; Sun, M.; et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef] [PubMed]
Guo, H.; Li, Y.; Shang, J.; Gu, M.; Huang, Y.; Gong, B. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Rauniyar, A.; Hagos, D.H.; Jha, D.; Håkegård, J.E.; Bagci, U.; Rawat, D.B.; Vlassov, V. Federated learning for medical applications: A taxonomy, current trends, challenges, and future research directions. IEEE Internet Things J. 2023, 11, 7374–7398. [Google Scholar] [CrossRef]
Tian, Y.; Wan, Y.; Lyu, L.; Yao, D.; Jin, H.; Sun, L. FedBERT: When federated learning meets pre-training. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–26. [Google Scholar] [CrossRef]
Rasmy, L.; Xiang, Y.; Xie, Z.; Tao, C.; Zhi, D. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med. 2021, 4, 86. [Google Scholar] [CrossRef] [PubMed]
Krishnan, R.; Rajpurkar, P.; Topol, E.J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 2022, 6, 1346–1352. [Google Scholar] [CrossRef]
Peng, L.; Luo, G.; Zhou, S.; Chen, J.; Xu, Z.; Sun, J.; Zhang, R. An in-depth evaluation of federated learning on biomedical natural language processing for information extraction. NPJ Digit. Med. 2024, 7, 127. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Peng, Y.; Yan, S.; Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. arXiv 2019, arXiv:1906.05474. [Google Scholar]
Berg-Kirkpatrick, T.; Burkett, D.; Klein, D. An empirical investigation of statistical significance in NLP. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Republic of Korea, 12–14 July 2012; pp. 995–1005. [Google Scholar]
Brisimi, T.S.; Chen, R.; Mela, T.; Olshevsky, A.; Paschalidis, I.C.; Shi, W. Federated learning of predictive models from federated electronic health records. Int. J. Med. Inform. 2018, 112, 59–67. [Google Scholar] [CrossRef]
Geyer, R.C.; Klein, T.; Nabi, M. Differentially private federated learning: A client level perspective. arXiv 2017, arXiv:1712.07557. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]

Figure 1. Visualization of the learned specialty similarity matrix. Darker colors indicate stronger knowledge transfer between specialties. Note the strong connections between related specialties like cardio/pulmonary and surgery.

Figure 2. ROC curves for MTTL-ClinicalBERT compared to baseline models. (a) Overall ROC curves across all specialties. (b) Specialty-specific ROC curves for MTTL-ClinicalBERT, demonstrating consistent performance across different medical domains. The dashed line indicates the performance of random classification.

Figure 3. Performance comparison across different medical specialties. MTTL-ClinicalBERT shows consistent improvements across all specialties, with particularly strong performance in surgery and cardio/pulmonary categories.

Figure 4. Model performance under different data availability scenarios. MTTL-ClinicalBERT maintains robust performance even with limited training data, demonstrating effective knowledge transfer.

Figure 5. Knowledge transfer similarity matrix for AG News categories.

Table 1. Representative specialty pair similarity scores.

Similarity Level	Specialty Pair	Initial Score	Rationale
High	Cardiology/Pulmonology	0.65	Shared cardiopulmonary terminology, physiological connections
High	Surgery/Anesthesiology	0.60	Procedural workflow overlap, shared perioperative terminology
Moderate	Neurology/Psychiatry	0.45	Overlapping neurological conditions, different therapeutic approaches
Moderate	Orthopedics/Rheumatology	0.40	Shared musculoskeletal focus, different treatment modalities
Low	Dermatology/Ophthalmology	0.15	Minimal terminology overlap, different physiological systems
Low	Radiology/Psychiatry	0.10	Different documentation structures, minimal terminology overlap

Table 2. Dataset statistics.

Specialty	Documents	Avg. Length	Vocab Size
Surgery	1,099	512.3	15,234
Cardio/Pulmonary	749	486.7	12,876
Orthopedic	599	473.2	11,543
Radiology	549	392.8	9876
Neurology	499	445.6	10,987
Others	1503	468.9	18,654
Total	4998	463.3	32,567

Table 3. Overall performance comparison. The bold indicates the best performing method.

Model	Accuracy	Macro F1	Weighted F1	AUC-ROC
BioBERT	82.3 ± 0.4	80.1 ± 0.5	81.4 ± 0.4	0.891 ± 0.003
ClinicalBERT	84.7 ± 0.3	82.6 ± 0.4	83.5 ± 0.3	0.912 ± 0.002
Clinical-Longformer	85.9 ± 0.3	84.2 ± 0.3	84.8 ± 0.3	0.923 ± 0.002
BlueBERT	83.5 ± 0.4	81.8 ± 0.4	82.6 ± 0.3	0.903 ± 0.003
MT-BERT	86.2 ± 0.3	84.5 ± 0.3	85.1 ± 0.2	0.925 ± 0.002
MTTL-ClinicalBERT	89.4 ± 0.2	87.8 ± 0.2	88.3 ± 0.2	0.946 ± 0.001

Table 4. Error analysis across different clinical content types.

Content Type	Term Error	Context Error	Cross-Ref Error	Integration Error
Standard Terms	6.8 ± 0.2	7.1 ± 0.2	8.4 ± 0.2	7.3 ± 0.2
Abbreviations	8.2 ± 0.3	9.5 ± 0.3	10.2 ± 0.3	9.8 ± 0.3
Complex Cases	9.4 ± 0.3	10.2 ± 0.3	11.5 ± 0.3	10.7 ± 0.3
Cross-Specialty	10.1 ± 0.3	11.8 ± 0.4	12.3 ± 0.4	11.9 ± 0.4

Table 5. Comparison with published results on clinical text classification. The bold indicates the best performing method.

Model	MTSamples	MIMIC-III	i2b2 2010
BioBERT [16]	82.3	89.4 *	84.7 *
ClinicalBERT [15]	84.7	90.1 *	85.3 *
Clinical-Longformer [33]	85.9	91.2 *	85.8 *
BlueBERT [46]	83.5	89.7 *	84.9 *
MT-BERT [20]	86.2	91.5 *	86.2 *
BioClinicalBERT [15]	85.1	90.8 *	85.6 *
MedBERT [42]	84.3	90.3 *	85.1 *
MTTL-ClinicalBERT (Ours)	89.4	93.2	88.7

* Published results from respective papers or public leaderboards.

Table 6. Ablation study results. The bold indicates the best performing method.

Model Variant	Accuracy	Macro F1
Full Model	89.4 ± 0.2	87.8 ± 0.2
w/o Knowledge Distillation	87.6 ± 0.3	85.7 ± 0.3
w/o Hierarchical Attention	87.9 ± 0.3	86.1 ± 0.3
w/o Dynamic Weighting	88.3 ± 0.2	86.5 ± 0.2

Table 7. Specialty-wise error analysis.

Specialty	Common Errors	Error Rate	Improvement	Knowledge Transfer
Surgery	Terminology Overlap	8.2%	+4.2%	High (0.83)
Cardio/Pulm	Procedure Context	7.8%	+3.8%	High (0.83)
Orthopedic	Condition Specificity	9.4%	+2.9%	Medium (0.62)
Radiology	Report Structure	10.2%	+2.1%	Low (0.55)
Neurology	Complex Cases	8.8%	+3.1%	Medium (0.62)

Table 8. Performance analysis under limited data conditions.

Specialty	30% Data	50% Data	70% Data	Recovery Rate
Surgery	86.2 ± 0.4	84.5 ± 0.4	87.8 ± 0.3	92.4%
Cardio/Pulm	85.8 ± 0.4	83.9 ± 0.4	87.2 ± 0.3	91.8%
Orthopedic	84.1 ± 0.5	82.2 ± 0.4	85.9 ± 0.4	89.5%
Radiology	83.9 ± 0.5	81.8 ± 0.5	84.7 ± 0.4	88.7%
Neurology	84.5 ± 0.4	82.4 ± 0.4	85.6 ± 0.3	90.2%

Table 9. Learning stability analysis in low-resource settings.

Data%	Convergence Time	Performance Variance	Knowledge Transfer Effect
30%	1.8× baseline	0.48 ± 0.05	+8.7%
50%	1.5× baseline	0.43 ± 0.04	+6.2%
70%	1.2× baseline	0.35 ± 0.03	+4.5%
100%	1.0× baseline	0.22 ± 0.02	+3.2%

Table 10. Statistical significance analysis (95% confidence intervals).

Model	Accuracy	Macro F1	p-Value
MT-BERT (Best Baseline)	$86.2 % \pm 0.3 %$	$84.5 % \pm 0.3 %$	-
MTTL-ClinicalBERT	$89.4 % \pm 0.2 %$	$87.8 % \pm 0.2 %$	<0.001

Table 11. Model performance in extreme clinical scenarios.

Scenario	MTTL-ClinicalBERT	Best Baseline
Minimal Digitization (10% data)	$78.3 % \pm 0.5 %$	$65.2 % \pm 0.7 %$
High-Heterogeneity Setting	$82.1 % \pm 0.4 %$	$72.5 % \pm 0.6 %$
Cross-Institution Validation	$81.5 % \pm 0.3 %$	$74.2 % \pm 0.5 %$

Table 12. Detailed knowledge transfer impact analysis. The ↔ indicates the pairwise relationship.

Specialty Pair	Similarity	Shared Terms	Cross-Impact	Error Reduction
Surgery ↔ Cardio	0.83 ± 0.02	42.3%	+7.2%	−4.8%
Ortho ↔ Neuro	0.62 ± 0.02	28.7%	+5.1%	−3.2%
Radio ↔ Surgery	0.58 ± 0.02	25.4%	+4.3%	−2.9%
Cardio ↔ Neuro	0.48 ± 0.02	18.9%	+3.1%	−2.1%

Table 13. Knowledge transfer evolution during training.

Training Phase	Avg. Similarity	Transfer Rate	Performance Gain
Early (25% epochs)	0.42 ± 0.03	0.31 ± 0.04	+2.8%
Mid (50% epochs)	0.58 ± 0.03	0.45 ± 0.03	+4.5%
Late (75% epochs)	0.67 ± 0.02	0.52 ± 0.03	+5.7%
Final	0.71 ± 0.02	0.58 ± 0.02	+6.4%

Table 14. Performance comparison on standard text classification datasets.

Model	IMDB	AG News	DBpedia
BiLSTM	91.1 ± 0.4	92.3 ± 0.3	98.3 ± 0.2
ELMo	92.8 ± 0.3	93.4 ± 0.2	98.9 ± 0.1
BERT	93.5 ± 0.2	94.1 ± 0.2	99.1 ± 0.1
ULMFiT	94.2 ± 0.2	94.5 ± 0.2	99.0 ± 0.1
MTTL-ClinicalBERT	94.7 ± 0.2	95.1 ± 0.2	99.2 ± 0.1

Table 15. Performance comparison when applying our proposed mechanisms to different base models.

Base Model	Vanilla	With Our Mechanisms	Improvement
ClinicalBERT [15]	84.7 ± 0.3	89.4 ± 0.2	+4.7%
BioBERT [16]	82.3 ± 0.4	87.8 ± 0.3	+5.5%
Clinical-Longformer [32]	85.9 ± 0.3	90.1 ± 0.2	+4.2%
BlueBERT [46]	83.5 ± 0.4	88.2 ± 0.3	+4.7%
MT-BERT [20]	86.2 ± 0.3	90.3 ± 0.2	+4.1%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Q.; Chen, S.; Liu, W. Balanced Knowledge Transfer in MTTL-ClinicalBERT: A Symmetrical Multi-Task Learning Framework for Clinical Text Classification. Symmetry 2025, 17, 823. https://doi.org/10.3390/sym17060823

AMA Style

Zhang Q, Chen S, Liu W. Balanced Knowledge Transfer in MTTL-ClinicalBERT: A Symmetrical Multi-Task Learning Framework for Clinical Text Classification. Symmetry. 2025; 17(6):823. https://doi.org/10.3390/sym17060823

Chicago/Turabian Style

Zhang, Qun, Shiyang Chen, and Wenhe Liu. 2025. "Balanced Knowledge Transfer in MTTL-ClinicalBERT: A Symmetrical Multi-Task Learning Framework for Clinical Text Classification" Symmetry 17, no. 6: 823. https://doi.org/10.3390/sym17060823

APA Style

Zhang, Q., Chen, S., & Liu, W. (2025). Balanced Knowledge Transfer in MTTL-ClinicalBERT: A Symmetrical Multi-Task Learning Framework for Clinical Text Classification. Symmetry, 17(6), 823. https://doi.org/10.3390/sym17060823

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Balanced Knowledge Transfer in MTTL-ClinicalBERT: A Symmetrical Multi-Task Learning Framework for Clinical Text Classification

Abstract

1. Introduction

2. Related Work

2.1. Clinical Text Classification

2.2. Transfer Learning in Clinical NLP

2.3. Emerging Approaches in Clinical NLP

3. Methodology

3.1. System Overview

3.2. Adaptive Knowledge Distillation

Learned Specialty Similarity Evaluation

3.3. Hierarchical Attention Architecture

3.4. Dynamic Task-Weighting Strategy

4. Experiments

4.1. Dataset

4.1.1. Data Preprocessing and Filtering

4.1.2. Dataset Limitations and Potential Biases

4.2. Baselines

4.3. Evaluation Metrics

4.4. Implementation Details

4.5. Results and Analysis

4.5.1. Overall Performance

4.5.2. Comparison with Published Results

4.5.3. Ablation Study

4.5.4. Performance on Different Specialties

4.5.5. Low-Resource Scenario Analysis

4.5.6. Statistical Significance Analysis

4.5.7. Performance in Extreme Clinical Scenarios

4.5.8. Knowledge Transfer Analysis

4.5.9. Cross-Domain Evaluation on Standard Datasets

Datasets and Experimental Setup

Results and Analysis

4.5.10. Transferability of Proposed Mechanisms

4.6. Reproducibility

5. Conclusions and Future Work

Future Validation in Clinical Environments

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI