MG-ECF: Multi-Granularity Entity-Context Fusion for Drug–Drug Interaction Extraction

Chanaa, Hiba; Chakir, Loqman; Nfaoui, El Habib

doi:10.3390/fi18060289

Open AccessArticle

MG-ECF: Multi-Granularity Entity-Context Fusion for Drug–Drug Interaction Extraction

by

Hiba Chanaa

^*

,

Loqman Chakir

and

El Habib Nfaoui

L3IA Laboratory, Faculty of Sciences Dhar El Mahraz, Sidi Mohamed Ben Abdellah University (USMBA), Fez 30000, Morocco

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(6), 289; https://doi.org/10.3390/fi18060289

Submission received: 23 April 2026 / Revised: 21 May 2026 / Accepted: 22 May 2026 / Published: 27 May 2026

Download

Browse Figures

Versions Notes

Abstract

Adverse drug–drug interactions (DDIs) are a leading cause of preventable medication-related harm, and their automatic detection from the biomedical literature is critical for pharmacovigilance and clinical decision support. Most existing systems derive relation representations from a single source, under-using the rich contextual structure of DDI sentences. We propose MG-ECF (Multi-Granularity Entity-Context Fusion), a relation-classification architecture that extracts three complementary views from a shared biomedical encoder—entity-level, inter-entity contextual, and global sentence representations—which are adaptively combined through a temperature-scaled gating mechanism regularized by view-dropout. MG-ECF was evaluated on the DDI-2013 benchmark under the official shared-task protocol, with multi-seed experiments on BioBERT and BiomedBERT backbones and a focal-loss objective to address severe class imbalance. MG-ECF achieves a mean micro-F1 of 90.55% with BiomedBERT and 88.8% with BioBERT, an absolute improvement of 2.7 F1 points over the strongest previously reported PLM-based system (BioMCL-DDI, 87.8%). Systematic component analyses confirm the contribution of each representational view, demonstrating the effectiveness of multi-granularity fusion for DDI classification and its potential as a research-stage building block for Internet-based pharmacovigilance platforms and networked clinical decision support systems, pending real-world clinical validation.

Keywords:

drug–drug interaction; biomedical relation extraction; BioBERT; BiomedBERT; multi-granularity fusion; pharmacovigilance; natural language processing

1. Introduction

Drug–drug interaction (DDI) detection and classification from biomedical text is a critical task in clinical pharmacovigilance [1,2]. Adverse DDIs are among the leading causes of preventable medication-related complications, contributing to hospital admissions, adverse drug events, and clinical decision-making errors [3,4]. As biomedical knowledge continues to expand through Internet-accessible repositories, online literature databases, and networked health platforms, automated DDI extraction has become an essential component of large-scale drug safety surveillance and cloud-based clinical decision support systems. Advances in deep learning and domain-specific pre-trained language models (PLMs) such as BioBERT [5] and BiomedBERT [6] have substantially improved DDI classification, and the DDI-2013 shared-task corpus [1,2] has become the de facto benchmark for comparing these systems. Despite this progress, DDI classification remains challenging.

Two structural difficulties persist. First, the DDI-2013 corpus is severely class-imbalanced: negative (non-interacting) pairs account for more than

80 %

of instances [7,8], which biases gradient-based optimization toward the majority class. Second, the semantic distinction between interaction types is subtle and context-dependent: distinguishing a pharmacokinetic mechanism from a clinical effect or prescriptive advise often requires capturing evidence that is syntactically distant from the target drug entities [9,10]. Most existing PLM-based approaches address these challenges by fine-tuning on a single representation source, typically the [CLS] token [11] or entity-marker embeddings [12,13], which collapses the sentence into one view and discards the complementary evidence distributed across entity, inter-entity, and global contextual regions.

Prior work has explored enriching PLM representations with entity-aware attention and external drug information [8,9,14] and deep recurrent architectures with attention mechanisms [10], yet none of these lines of work jointly models all three granularity levels entity-specific, inter-entity contextual, and global sentence-level within a single adaptive fusion framework. Comparisons across these studies are further complicated by inconsistent experimental protocols: some works merge the official training and development sets while others do not, making it difficult to isolate architectural gains from data-preparation choices [1].

Several recent DDI extraction systems have pursued complementary strategies. Zhu et al. [9] combine entity-aware attention layers over BioBERT and BiGRU outputs, yet do not isolate entity spans and inter-entity context as separately pooled views from a single shared encoder. EMSI-BERT [15] constructs multiple symbol-inserted input variants and extracts a [CLS] representation per variant, fusing at the input level with no adaptive per-instance gate. BioFocal-DDI [16] combines BioGPT data augmentation with a BioBERT–BiLSTM–ReGCN pipeline, but does not extract dedicated entity-span or inter-entity context views as separate streams. BioMCL-DDI [17], the current strongest baseline (

μ F_{1} = 87.8 %

), applies meta-contrastive learning to a single [CLS] vector without separately pooling entity spans or inter-entity context. To the best of our knowledge, MG-ECF is the first DDI extraction system that simultaneously (i) extracts entity-span, inter-entity context, and global [CLS] representations from a single shared biomedical encoder, (ii) fuses them through a temperature-scaled, entropy-regularized gate whose weights are computed per instance, and (iii) combines this multi-granularity fusion with view-dropout regularization in a single unified training objective. A full system-by-system comparison is provided in Section 2.

We ask, can explicitly extracting and adaptively fusing representations at three complementary granularity levels yield more discriminative DDI classification than any single-view baseline? To answer this, we propose MG-ECF (Multi-Granularity Entity-Context Fusion), a framework that extracts three parallel feature streams from a shared biomedical transformer encoder and combines them through a temperature-scaled gating mechanism [18,19] regularized by view-dropout [20] and an entropy penalty. We evaluate MG-ECF on the DDI-2013 benchmark under the official train/dev/test protocol [1] using both BioBERT and BiomedBERT, with multi-seed experiments to ensure statistically reliable estimates [21,22].

The remainder of this paper is organized as follows. Section 2 surveys related work; Section 3 presents the proposed methodology; Section 4 reports experimental results and ablation analyses; Section 5 discusses model behavior and limitations; and Section 6 concludes with directions for future work.

The main contributions of this work are as follows:

1.: We propose MG-ECF, a multi-granularity fusion architecture that jointly integrates entity-level, inter-entity contextual, and global sentence representations for biomedical relation classification.
2.: We introduce an adaptive temperature-scaled gating mechanism with view-dropout regularization, enabling robust, instance-aware feature fusion.
3.: We conduct an extensive empirical evaluation on DDI-2013 using BioBERT and BiomedBERT, including multi-seed experiments and ablation studies that quantify the contribution of each representational component.
4.: We employ a class-balanced focal loss objective to directly address the severe label imbalance inherent in the DDI-2013 corpus, where the negative class constitutes over $80 %$ of training pairs. By down-weighting easy majority-class examples and focusing gradient updates on hard minority samples, this strategy enables more reliable learning across the clinically significant but under-represented interaction types.

2. Related Work

Research on the automatic extraction of drug–drug interactions (DDIs) from the biomedical literature has evolved along a well-defined methodological trajectory. Since the release of the DDI corpus and the DDIExtraction 2013 shared task [1,2], the field has progressed from hand-crafted rules and linguistic kernels, through feature-engineered machine learning, to deep neural architectures and, more recently, to domain-specific contextualized language models [23,24]. In this section, we survey representative work from each of these four generations, summarize their respective strengths and limitations, and position the proposed MG-ECF with respect to the current state of the art.

2.1. Rule-Based and Linguistic-Kernel Methods

The earliest DDI extraction systems relied on manually designed patterns, syntactic rules, and linguistic kernels. Segura-Bedmar et al. [25] proposed a shallow linguistic kernel combining lexical, morpho-syntactic, and chunk-level features. At DDIExtraction 2013, the UTurku group [26] used an SVM-based machine learning pipeline with features derived from dependency parse graphs and domain knowledge resources, and Thomas et al. [27] aggregated multiple kernel-based classifiers through majority voting. While interpretable and data-efficient, these systems are brittle to syntactic variation, susceptible to cascading NLP tool errors, and unable to capture long-range semantic dependencies.

2.2. Classical Machine Learning Methods

Classical ML approaches framed DDI extraction as supervised pairwise classification over hand-crafted features. Chowdhury and Lavelli [28] combined tree and graph kernels with lexical and entity-based features, ranking among the top DDIExtraction 2013 entries. Kim et al. [29] showed that a linear SVM on rich feature sets could match more expensive kernel methods, while Raihani and Laachfoubi [30] addressed class imbalance through refined feature engineering. The fundamental limitation of this generation is feature dependency: templates are labour-intensive, fail to generalize across syntactic variations, and cannot capture distributed semantic similarity between drugs [23].

2.3. Deep Learning Methods

Deep learning approaches replaced hand-crafted features with learned distributed representations. Liu et al. [31] introduced position-aware CNNs for DDI extraction; Zhao et al. [32] and Quan et al. [33] extended these with syntactic features and multichannel embeddings. On the recurrent side, Sun et al. [34] developed an RHCNN with focal loss; Fatehifar and Karshenas [10] applied a position-and-similarity fusion attention mechanism over BiLSTM-encoded representations; and Zhang et al. [7] proposed a hierarchical RNN over the shortest dependency path, setting the pre-PLM state of the art. Despite eliminating hand-crafted features, these models train on a small corpus (≈27,000 pairs), rely on a single representation source, and struggle to distinguish fine-grained interaction types such as int and effect.

2.4. Contextualized Language-Model-Based Methods

Pre-trained language models (PLMs) have redefined the state of the art for biomedical relation classification. BioBERT [5] and BiomedBERT [6], pre-trained on PubMed and PMC corpora, provide strong biomedical representations. On the DDI task, Zhu et al. [9] encoded sentences with BioBERT and BiGRU and applied entity-aware attention mechanisms incorporating external drug descriptions; Mondal [14] incorporated chemical structure information; Asada et al. [8] combined molecular structure graph representations and drug description text with the sentence encoding; Huang et al. [15] proposed EMSI-BERT, constructing symbol-inserted BERT input variants to represent entity combinations (

μ F_{1} = 82.0 %

). More recently, Yuan et al. [16] proposed BioFocal-DDI, using a BioBERT–BiLSTM–ReGCN encoder with focal-loss attention and BioGPT-based data augmentation (

μ F_{1} = 86.6 %

); Jia et al. [17] applied meta-contrastive learning (BioMCL-DDI,

μ F_{1} = 87.8 %

), setting the strongest prior PLM-based baselines on DDI-2013.

Despite their superiority, two gaps persist. (i) Granularity gap. Most systems derive a single fixed representation, either the [CLS] token [11] or entity-marker embeddings [12,13], thereby discarding complementary evidence distributed across entity, inter-entity, and global regions of the sentence. (ii) Fusion gap. When multiple views are used, they are combined with fixed weights, without instance-adaptive modulation. These two gaps directly motivate the design of MG-ECF.

A fine-grained comparison clarifies what MG-ECF contributes beyond each of these systems. Zhu et al. [9] build three parallel entity-aware attention layers over BioBERT and BiGRU outputs and concatenate their results for final classification; the entity spans and inter-entity context region are not isolated as separately pooled views from a single shared encoder pass, and no per-instance adaptive gate controls the relative contribution of each stream. EMSI-BERT [15] constructs four symbol-inserted input formulations of the same sentence and processes each through BERT independently, producing a single [CLS] output per variant; the fusion operates at the input-formatting level and no adaptive per-instance gate modulates channel contributions. BioFocal-DDI [16] uses BioGPT for training-data augmentation and a BioBERT–BiLSTM–ReGCN pipeline for feature extraction with focal-loss attention; it does not extract dedicated entity-span or inter-entity context views as separately pooled streams. BioMCL-DDI [17] applies meta-contrastive learning to a single [CLS] vector and does not pool entity spans or the inter-entity context as separate representational streams. None of these systems simultaneously: (i) extracts all three views from a single shared encoder in one forward pass; (ii) fuses them through a per-instance temperature-scaled, entropy-regularized gate; and (iii) integrates view-dropout into the unified training objective. MG-ECF is, to the best of our knowledge, the first DDI extraction system to satisfy all three conditions jointly.

3. Materials and Methods

This section presents the proposed MG-ECF (Multi-Granularity Entity-Context Fusion) framework for drug–drug interaction (DDI) classification. MG-ECF is motivated by a specific weakness of prior biomedical relation-extraction approaches: their reliance on a single representation source, such as the [CLS] token [11] or entity-marker embeddings [12,13]. In contrast, MG-ECF explicitly extracts three complementary views from a shared biomedical encoder [5,6] and adaptively fuses them through a temperature-scaled gating mechanism [18,19] regularized by both view-dropout and an entropy penalty on the gate distribution. An overview of the architecture is shown in Figure 1.

3.1. Problem Formulation

Let

D = {(S^{(i)}, e_{1}^{(i)}, e_{2}^{(i)}, y^{(i)})}_{i = 1}^{N}

denote a labeled DDI corpus, where each instance consists of a sentence S, an ordered pair of target drug entities

(e_{1}, e_{2})

, and a relation label

y \in Y = {mechanism, effect, advise, int, false} .

Following the DDIExtraction 2013 task definition [1,2], the first four labels correspond to the positive interaction categories, while false denotes drug pairs that do not express any interaction. In this paper, we use negative as a binary umbrella term, where negative is equivalent to false. Given

(S, e_{1}, e_{2})

, the goal is to learn a classifier

f_{θ} : (S, e_{1}, e_{2}) \mapsto \hat{y} \in Y

with parameters

θ

.

3.2. Input Representation with Entity Markers

Following the matching-the-blanks and R-BERT line of work [12,13], we adopt a marker-based input construction. Four special tokens are added to the tokenizer vocabulary: [DRUG_1], [/DRUG_1], [DRUG_2], and [/DRUG_2]. Let

M

denote this marker-token set. These markers are inserted so that each target entity is enclosed between its respective opening and closing tokens. Non-target drug mentions in the same sentence are replaced by a generic placeholder token

[DRUG]

, so that the model cannot exploit lexical shortcuts. This blinding protocol is standard practice on DDI-2013 [8,9]. The resulting token sequence of length n is

X = (x_{1}, x_{2}, \dots, x_{n}),

(1)

and the corresponding marker-position index set is

I_{M} = {i ∣ x_{i} \in M},

(2)

The sequence is then truncated or padded to a maximum length of

L_{\max} = 128

.

3.3. Contextual Encoding

The token sequence X is fed into a pre-trained biomedical transformer encoder

Enc (\cdot)

, producing contextualized token representations

H = Enc (X) = (h_{1}, h_{2}, \dots, h_{n}), h_{i} \in R^{d},

(3)

where d is the encoder hidden dimension (

d = 768

for the base variants of BioBERT [5] and BiomedBERT/PubMedBERT [6] used in our experiments). The encoder is fine-tuned jointly with the downstream components.

3.4. Multi-Granularity Feature Extraction

MG-ECF extracts three complementary views from

H

at three different granularity levels.

(i): Entity-level representation ( $h_{ent}$ ).

Let

I_{e_{1}}

and

I_{e_{2}}

be the index sets of the tokens spanned by entities

e_{1}

and

e_{2}

(excluding the marker tokens). We apply mean pooling [13] over each span,

h_{e_{k}} = \frac{1}{| I_{e_{k}} |} \sum_{i \in I_{e_{k}}} h_{i}, k \in {1, 2},

(4)

and concatenate the two entity vectors:

h_{ent} = [h_{e_{1}}; h_{e_{2}}] \in R^{2 d},

(5)

where

[\cdot; \cdot]

denotes vector concatenation. This stream captures drug-specific semantic information tied to the two target entities.

(ii): Inter-entity contextual representation ( $h_{ctx}$ ).

Let

p_{1} = \max (I_{e_{1}})

and

p_{2} = \min (I_{e_{2}})

denote the last token of

e_{1}

and the first token of

e_{2}

, assuming

p_{1} < p_{2}

(otherwise the two indices are swapped so that the definition is symmetric). The inter-entity context window is defined as

I_{ctx} = {i ∣ p_{1} < i < p_{2}, i \notin I_{M}} .

(6)

Marker tokens are positional sentinels that carry no relational linguistic content; their exclusion prevents

h_{ctx}

from overlapping with the entity-identity signals already captured in

h_{ent}

. If

I_{ctx} = \emptyset

after marker exclusion, we fall back to a one-token outer lexical neighborhood around the entity pair:

I_{ctx} = (\{\min (I_{e_{1}} \cup I_{e_{2}}) - 1, \max (I_{e_{1}} \cup I_{e_{2}}) + 1\} \cap [1, n]) ∖ I_{M} .

(7)

If this set is still empty in a boundary case, we use

I_{ctx} = {1}

(the [CLS] position) to ensure a non-empty pool. In practice, neither fallback tier activates on any DDI-2013 test instance; the [CLS] guard is retained solely to prevent a zero-denominator in mean pooling. A learned null-context embedding is identified as a future architectural refinement for the rare cases where the inter-entity window and its outer neighborhood are simultaneously empty. The contextual representation is obtained by mean pooling:

h_{ctx} = \frac{1}{| I_{ctx} |} \sum_{i \in I_{ctx}} h_{i} \in R^{d} .

(8)

This stream captures the relational cues that distinguish interaction types, such as modal verbs for advise, pharmacokinetic terms for mechanism, and outcome descriptors for effect [9].

(iii): Global sentence representation ( $h_{cls}$ ).

We use the first special-token embedding produced by the encoder as the global sentence representation,

h_{cls} = h_{[CLS]} \in R^{d},

(9)

preserving long-range semantic context that may escape the local inter-entity window.

3.5. Adaptive Multi-View Fusion

3.5.1. Representation Projection

Since

h_{ent} \in R^{2 d}

while

h_{ctx}, h_{cls} \in R^{d}

, we first project all three views to a common dimension

d^{'}

through view-specific affine transformations followed by a tanh non-linearity:

\begin{matrix} {\tilde{h}}_{ent} & = \tanh (W_{ent} h_{ent} + b_{ent}), \end{matrix}

(10)

\begin{matrix} {\tilde{h}}_{ctx} & = \tanh (W_{ctx} h_{ctx} + b_{ctx}), \end{matrix}

(11)

\begin{matrix} {\tilde{h}}_{cls} & = \tanh (W_{cls} h_{cls} + b_{cls}), \end{matrix}

(12)

with

W_{ent} \in R^{d^{'} \times 2 d}

,

W_{ctx}, W_{cls} \in R^{d^{'} \times d}

, and

{\tilde{h}}_{v} \in R^{d^{'}}

for

v \in {ent, ctx, cls}

. In our experiments we set

d^{'} = d = 768

.

3.5.2. Temperature-Scaled Gating

To combine the three aligned views adaptively, we compute gating scores that depend on the views themselves. Let

u = [{\tilde{h}}_{ent}; {\tilde{h}}_{ctx}; {\tilde{h}}_{cls}] \in R^{3 d^{'}}

. A lightweight gating network produces unnormalized logits

z = W_{g} u + b_{g} \in R^{3},

(13)

with learnable parameters

W_{g} \in R^{3 \times 3 d^{'}}

and

b_{g} \in R^{3}

. The logits are converted into a distribution over views by a softmax with temperature

τ > 0

:

α = softmax (\frac{z}{τ}), α_{v} = \frac{\exp (z_{v} / τ)}{\sum_{v^{'}} \exp (z_{v^{'}} / τ)} .

(14)

The temperature

τ

controls the sharpness of the distribution:

τ \to 0^{+}

drives

α

towards a one-hot selection, whereas

τ \to \infty

drives it towards a uniform average [18]. We treat

τ

as a hyperparameter and set

τ = 4.0

based on development-set search; a systematic sensitivity analysis over

τ \in {1, 2, 4, 8, \infty}

is reported in Section 4.6.

The fused representation is the gated sum of the three aligned views:

h_{fused} = \sum_{v \in {ent, ctx, cls}} α_{v} {\tilde{h}}_{v} \in R^{d^{'}} .

(15)

3.5.3. View-Dropout Regularization

To prevent over-reliance on a single stream and to encourage complementary feature learning across views, we introduce view-dropout, a stream-level analog of standard dropout [20]. During training only, we sample a Bernoulli mask

m_{v} \sim Bern (1 - p_{view})

independently for each view v and replace Equation (15) with

h_{fused} = \sum_{v} {\hat{α}}_{v} {\tilde{h}}_{v}, {\hat{α}}_{v} = \frac{m_{v} α_{v}}{\sum_{v^{'}} m_{v^{'}} α_{v^{'}} + ϵ},

(16)

where

ϵ = 10^{- 8}

guarantees numerical stability. If all three masks happen to be zero we resample. This forces the classifier to remain accurate even when any single stream is absent, yielding more robust decisions. We use

p_{view} = 0.3

throughout; sensitivity to this choice is examined in Section 4.6. At inference time no mask is applied and Equation (15) is used.

3.5.4. Entropy Regularization of the Gate

To prevent the gating network from collapsing onto a single view early in training, we add an entropy penalty on the gate distribution. For a single example the entropy regularizer is

L_{ent} (α) = - λ \sum_{v} α_{v} \log α_{v},

(17)

with weight

λ = 0.3

(sensitivity analyzed in Section 4.6). Maximising entropy acts as a soft prior that maintains a well-distributed gate at the beginning of training, and the data-driven gradient of the task loss progressively moves

α

toward instance-specific fusion patterns as training proceeds.

3.6. Classification and Imbalance-Aware Loss

The fused representation is passed through a two-layer classification head with a ReLU activation and dropout,

o = W_{c} Dropout (ReLU (W_{h} h_{fused} + b_{h})) + b_{c} \in R^{| Y |},

(18)

p (y ∣ S, e_{1}, e_{2}) = softmax (o) .

(19)

Class-balanced focal loss.

The DDI-2013 corpus is strongly skewed toward the negative class, which accounts for more than

80 %

of the training pairs [7,8]. To prevent the majority class from dominating the gradient signal, we use the focal loss [35] with per-class balancing weights. For a single example with ground-truth label

y = c

and predicted probability

p_{c} = p (y = c ∣ S, e_{1}, e_{2})

, the task loss is

L_{focal} (c) = - w_{c} {(1 - p_{c})}^{γ} \log p_{c},

(20)

where

γ \geq 0

is the focusing parameter (we set

γ = 2

). The class weights

w_{c}

are derived from the training-set class frequencies and fixed to

w = [0.501, 0.392, 0.803, 3.277, 0.027]

for

{mechanism, effect, advise, int, false}

, which approximately inverts the label prior [36]. Setting

γ = 0

and

w_{c} = 1

recovers standard cross-entropy as a special case.

Full training objective.

The per-example loss is the sum of the focal task loss and the entropy regularizer of Equation (17):

L^{(i)} = L_{focal} (y^{(i)}) + L_{ent} (α^{(i)}),

(21)

and the full training objective is the mean over a mini-batch

B

:

L (θ, ϕ) = \frac{1}{| B |} \sum_{i \in B} L^{(i)},

(22)

where

θ

and

ϕ

collect the parameters of the downstream components and of the encoder, respectively.

3.7. Training Algorithm

Training iterates over mini-batches: for each candidate pair, the marker-decorated sequence is encoded, three views are extracted and projected, gate weights are computed via Equation (14), view-dropout is applied (Equation (16)), and all parameters are updated jointly with AdamW [37]. At inference time, view-dropout is disabled (

m_{v} \equiv 1

) and the predicted label is

\hat{y} = \arg \max_{c} p (y = c ∣ S, e_{1}, e_{2})

.

3.8. Implementation Details

We implemented MG-ECF in PyTorch 2.10.0 (Meta AI, Menlo Park, CA, USA) [38] on top of the Hugging Face Transformers library (Hugging Face, New York, NY, USA) [39]. We used the publicly released checkpoints of BioBERT-base-cased v1.2 [5] and BiomedBERT-base (PubMedBERT uncased, abstracts + full text) [6] as backbone encoders; both expose a hidden size of

d = 768

and 12 transformer layers. The full model contains 114,223,112 parameters, roughly +4.2 M (3.8% overhead) over a

\sim 110

M backbone, with negligible inference latency impact. All experiments were run on a single NVIDIA T4 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 16 GB of memory.

The maximum sequence length is

L_{\max} = 128

tokens, and the common projection dimension is

d^{'} = 768

. The four marker tokens [DRUG_1], [/DRUG_1], [DRUG_2], and [/DRUG_2] and the placeholder [DRUG] were added as new special tokens; their embeddings were randomly initialized and fine-tuned jointly with the rest of the model. Full training hyperparameters and the multi-seed evaluation protocol are detailed in Section 4.2.

Table 1 compares the CLS-only BiomedBERT baseline against full MG-ECF on four practical cost dimensions. The three pooling operations and the lightweight gate add only

+ 4.2

M parameters over the backbone (

+ 3.8 %

), a negligible penalty relative to the

+ 2.7

absolute

μ F_{1}

gain they deliver. Training time per epoch and peak GPU memory are dominated by the 110 M parameter transformer backbone and are nearly identical across configurations.

The overhead on all four dimensions is below

4 %

, confirming that MG-ECF is practically deployable under the same resource budget as a standard BERT fine-tuning baseline.

4. Results

4.1. Dataset and Evaluation Protocol

We evaluated MG-ECF on the DDI-2013 corpus [1,2], the de facto benchmark for drug–drug interaction extraction. The corpus aggregates drug descriptions from DrugBank and MEDLINE abstracts, annotated at the candidate-pair level with five labels: mechanism, effect, advise, int, and the negative class false. We used the official training, development, and test splits without any additional pre-processing beyond the entity-marker insertion described in Section 3. The resulting sample counts are given in Table 2. The distribution is strongly skewed toward the negative class, which accounts for

86.0 %

of the training pairs, and the int class is particularly rare (≈0.7% of training pairs and only 96 test pairs).

Table 2 highlights the long-tail label regime that motivates our imbalance-aware objective and the class-specific analyses reported later.

Following the official DDI-2013 protocol, we report micro-averaged precision (P), recall (R), and

F_{1}

(

μ F_{1}

) computed over the four positive classes only; the false class is excluded from the aggregate, as is standard [1]. We additionally report the macro-

F_{1}

to expose the influence of rare classes, and per-class

F_{1}

for a fine-grained view.

4.2. Experimental Setup

This subsection describes the experimental design: backbone selection, training configuration, and evaluation protocol. Software and hardware specifics are covered in Section 3.8.

Backbones.

We instantiated MG-ECF on two domain-specific pre-trained language models: BioBERT (dmis-lab/biobert-base-cased-v1.2) [5] and BiomedBERT/PubMedBERT (microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) [6]. Both are base-size models (

d = 768

, 12 layers, 12 attention heads,

\sim 110

M parameters).

Training.

We trained with AdamW [37,40] using a linear warm-up schedule (warmup ratio 0.1) and a per-step batch size of 16 with

\times 2

gradient accumulation (effective batch size 32). All hyperparameters were fixed: weight decay 0.01, maximum sequence length 128, learning rate

1 \times 10^{- 6}

, up to 12 epochs with early stopping on development

μ F_{1}

(patience 5–8), focal loss [35] with

γ = 2.0

and inverse-frequency class weights

w = [0.501, 0.392, 0.803, 3.277, 0.027]

, gate temperature

τ = 4.0

, view-dropout

p_{view} = 0.3

, and entropy regularizer

λ = 0.3

. These values were selected by grid search on the official development set prior to any test-set evaluation; the search grids were

τ \in {1, 2, 4, 8}

,

γ \in {0, 1, 2, 3}

,

λ \in {0, 0.3, 0.5, 1.0}

, and

p_{view} \in {0, 0.1, 0.3, 0.5}

(full sensitivity reported in Section 4.6). The three random seeds

{42, 123, 456}

were fixed before any experiment and no seed was selected or filtered based on test-set performance; all MG-ECF results are reported as mean ± standard deviation, following Reimers and Gurevych [21] and Dodge et al. [22].

Model selection.

For every run, the checkpoint with the highest development

μ F_{1}

over the four positive classes was used at test time. No hyperparameter was tuned on the test set.

4.3. Overall Performance and Comparison with the State of the Art

Table 3 compares MG-ECF against representative systems spanning three methodological generations. Numbers for prior systems are taken from the corresponding papers; all entries use the official DDI-2013 test split and the four-class

μ F_{1}

metric, making the comparison strictly like-for-like. Prior-system results are single-point estimates: none of the baseline publications reports multi-seed variance, making it impossible to include standard deviations for those entries. Statistical significance of the MG-ECF improvement is instead established through the one-sample t-test reported below.

Figure 2 situates our result within the historical trajectory of DDI-2013 systems and makes the generation-level performance trend explicit.

MG-ECF with BiomedBERT achieved a

μ F_{1}

of

0.905 \pm 0.0004

, surpassing the strongest published PLM comparator in Table 3 (BioMCL-DDI,

μ F_{1} = 0.878

) by 2.7 absolute percentage points, and the best non-pretrained neural system [7] by over 15 points (Table 3). The negligible standard deviation (

\pm 0.0004

) across three independent seeds confirms that the improvement is stable and not seed-dependent. To quantify statistical reliability: the

+ 2.7

percentage-point improvement over BioMCL-DDI (

μ F_{1} = 0.878

) is

22.5 \times

larger than the full three-seed variation band (

3 σ = 3 \times 0.0004 = 0.0012

). A one-sample t-test against the null hypothesis

μ F_{1} \leq 0.878

yields

t (2) = 116.9

,

p < 10^{- 5}

, confirming that the observed gain cannot be attributed to random initialization. The BioBERT variant reached

μ F_{1} = 0.888 \pm 0.008

, itself above all prior published results, which shows that the gains are a property of the MG-ECF architecture and not specific to the BiomedBERT pre-training corpus.

To separate the contribution of the backbone from the contribution of the fusion module, Table 4 (Section 4.4) includes a CLS-only BiomedBERT baseline (standard single-vector fine-tuning) that achieves

μ F_{1} = 0.873 \pm 0.002

. This baseline falls just 0.5 points short of the strongest prior PLM-based result (BioMCL-DDI,

μ F_{1} = 0.878

), confirming that BiomedBERT is itself a competitive backbone; the multi-granularity fusion then adds an additional

+ 3.2

points on top to establish the new state of the art.

4.4. Ablation Study

Table 4 presents an ablation study in which architectural components are progressively added to MG-ECF, with all variants evaluated on both backbones. For both backbones, all five variants are reported as mean ± std over three seeds

{42, 123, 456}

. The five variants are: (i) CLS-only, standard single-vector fine-tuning; (ii) Entity-only, entity-marker pooling without a context bridge or gate; (iii) w/o Context Bridge, entity and global views fused without the inter-entity context stream; (iv) Concat fusion, all three views combined with a fixed linear head (no gate); and (v) Full MG-ECF with the temperature-scaled gate.

Figure 3 visualizes the same monotonic improvement pattern reported in Table 4 for both backbones.

All three views contributed non-redundantly on BiomedBERT. The largest single drop came from removing all specialized structures (

- 3.2

points, CLS-only vs. full), confirming that standard single-vector fine-tuning under-uses the relational structure of DDI sentences. Adding entity markers recovered

+ 0.9

points (Entity-only), consistent with the established benefit of marker-based representations [12,13]. Adding the global sentence representation contributed a further

+ 1.3

points (Entity-only vs. w/o Context Bridge), confirming that long-range sentence context not captured by entity spans alone is necessary for type disambiguation. The inter-entity context bridge then added

+ 0.4

points (w/o Context Bridge vs. Concat fusion), showing that the lexical material between the two drug spans carries relational evidence not captured by entity markers or the [CLS] token alone. Finally, replacing the soft gate with a fixed concatenation head (Concat fusion) lost

0.6

points, demonstrating that dynamic per-instance weighting outperforms a fixed linear combination. The BioBERT boundary rows confirm that the net architectural effect (

+ 3.1

points, CLS-only to full) is not specific to BiomedBERT’s broader pre-training corpus.

The individual contribution of each regularization component is formally quantified in the hyperparameter sweep of Section 4.6, which provides a dedicated per-design-choice ablation: disabling entropy regularization (

λ = 0

) costs

- 0.7

points, deactivating view-dropout (

p_{view} = 0

) costs

- 0.9

points, and using a sharp gate (

τ = 1.0

versus the default

τ = 4.0

) costs

- 1.2

points. Together with the architectural ablation in Table 4, these results confirm that both the three-view design and each of the three novel regularization choices are independently necessary for the full performance of MG-ECF.

The full MG-ECF configuration started near the CLS-only level at epoch 1 (gate near-uniform at initialization) and progressively overtook all partial variants as the entity and context streams specialized, confirming that the improvement accumulates during training rather than appearing only at the final checkpoint.

4.5. Per-Class Analysis

Table 5 reports per-class performance for both backbone variants (BiomedBERT and BioBERT, seed 42) side by side. For BiomedBERT, the model reached

F_{1} \geq 0.94

on two of the four positive classes, with particularly strong performance on the safety-critical advise class. The int class remained the dominant error source: the decomposition

P = 0.879

/

R = 0.531

indicates that MG-ECF was precise on int but achieved low recall, consistent with the class being severely under-represented in training (177 instances,

0.7 %

of the training set). This pattern has been reported by every DDI-2013 system across all methodological generations [8,28] and is analyzed further in Section 4.8 and Section 5.

Comparing the BiomedBERT and BioBERT columns of Table 5 reveals that the

- 1.8

pp micro-

F_{1}

gap between the two backbones (seed 42: 0.906 vs. 0.888) is distributed across all four classes. The per-class decomposition is: mechanism −2.9 pp, advise

- 2.0

pp, effect

- 0.6

pp, and int

- 3.6

pp. The int class shows the largest absolute gap (

0.662

vs.

0.626

), with BioBERT’s recall falling to

0.476

compared with BiomedBERT’s

0.531

. Both backbones exhibit the same high-precision/low-recall pattern on int (BioBERT:

P = 0.918

,

R = 0.476

; BiomedBERT:

P = 0.879

,

R = 0.531

), confirming that the bottleneck is driven by chronic training scarcity (177 instances,

0.7 %

of training pairs) rather than backbone pre-training corpus choice.

Figure 4 complements Table 5 by making the class-wise margin over representative baselines immediately comparable.

4.6. Hyperparameter Sensitivity

Table 6 reports the sensitivity of MG-ECF (BiomedBERT, seed 42) to four key hyperparameters: gate temperature

τ

, focal loss

γ

, entropy regularization weight

λ

, and view-dropout rate

p_{view}

. Each hyperparameter was varied over four values while all others were held at their defaults.

Three patterns emerged from the sweep. First, a sharper gate (

τ = 1.0

) hurt performance by

1.2

points relative to the default

τ = 4.0

: a lower temperature concentrates gate mass prematurely on one view during early training, making the entropy regularizer less effective. Second, focal loss (

γ = 2.0

) was the single most impactful individual choice, outperforming standard cross-entropy (

γ = 0

) by

2.8

points, confirming the importance of down-weighting the overwhelming negative class on this severely imbalanced corpus. Third, both the entropy coefficient (

λ = 0.3

) and view dropout (

p_{view} = 0.3

) provided consistent gains of approximately

0.7

–

0.9

points over their deactivated baselines (

λ = 0

,

p_{view} = 0

), and degraded gracefully as their values were pushed beyond the optimum. Overall, MG-ECF showed low sensitivity across reasonable ranges of all four hyperparameters, with the default configuration achieving the best result in every sweep. The following section examines the learned gate weights to provide a mechanistic interpretation of how the three views are exploited per class.

4.7. Analysis of the Learned Gating

We examined the weights assigned by the temperature-scaled gate to the three views after training converged. Averaged across all test instances, the gate distributed attention nearly uniformly: entity (

\approx 0.337

), context bridge (

\approx 0.326

), and global (

\approx 0.337

). This near-uniform average is by design: the combination of

τ = 4.0

,

λ = 0.3

, and

p_{view} = 0.3

actively prevents the gate from collapsing onto a single stream. Crucially, the ablation results in Table 4 show that physically removing any view consistently reduced performance, providing direct evidence that all three streams were genuinely exploited and not redundant.

Per-class gate inspection revealed qualitatively coherent specialization, despite the modest absolute magnitude of the shifts (see Figure 5). Advise predictions, characterized by explicit regulatory phrases such as should not be used with or is contraindicated with, assigned a relatively higher weight to the entity view, consistent with advisory interactions often depending on the identity of the specific drug pair involved. In contrast, mechanism predictions weighted the context bridge more strongly, consistent with mechanistic interactions being signalled by pharmacokinetic language between the two drug mentions (e.g. “inhibits the CYP3A4 metabolism of”). These patterns were consistent across seeds, though small in absolute magnitude; their primary value is qualitative coherence rather than quantitative magnitude.

The gate weight shifts, while modest in absolute size, are mechanistically coherent and consistent with the syntactic structure of DDI sentences. Inter-entity context dominance for mechanism: pharmacokinetic trigger phrases such as “inhibits the CYP3A4 metabolism of” or “displaces from plasma protein binding” appear precisely in the inter-entity span; the context bridge aggregates exactly these tokens, explaining why removing it (w/o Context Bridge in Table 4) causes the largest per-class drop on mechanism predictions. Entity view dominance for advise: advisory interactions are often determined by the pharmacological class membership of the specific drug pair (e.g., a contraindication holds for all drugs of a particular class); the entity view encodes what the drugs are through their span representations, providing the identity signal that the global [CLS] context alone cannot. Global view stabilization for effect and int: the broad discourse framing (“this combination may produce…”, “patients receiving both…”) is distributed across the whole sentence and is best captured by the global representation; the gate appropriately assigns it a relatively higher weight for these classes, where predicate-argument structure at the sentence level provides the decisive evidence.

At the default

τ = 4.0

, the mean gate entropy across all test instances is

H \approx 0.94 \times \log (3)

\approx 1.03

bits, near the maximum entropy for a 3-way distribution (

\log 3 \approx 1.10

bits), confirming that the gate distributes weight broadly and does not collapse onto a single view. At the sharper

τ = 1.0

, the mean entropy drops to

H \approx 0.48 \times \log (3) \approx 0.53

bits: the gate becomes quasi-one-hot, effectively ignoring two of the three views for most instances, which explains the

1.2

-point performance drop in Table 6. The entropy regularizer (

λ = 0.3

) provides complementary protection during early training: without it (

λ = 0

), the gate entropy at epoch 3 is approximately

30 %

lower than with it, consistent with the task gradient not yet being strong enough to prevent premature view collapse, and with the

0.7

-point gap in Table 6. Together, temperature softening and entropy regularization ensure that all three views are exploited throughout training, with the gate becoming progressively more instance-specific as training progresses.

The gate entropy trajectory over training epochs provides further quantitative evidence. At epoch 1 (near-initialization), gate logits are close to zero and both configurations (

λ = 0.3

and

λ = 0

) operate near maximum entropy (

H \approx \log (3) \approx 1.10

bits). By epoch 3, the two configurations diverge: with

λ = 0.3

,

H \approx 1.01

bits (92% of maximum), while without regularization (

λ = 0

),

H \approx 0.71

bits (65% of maximum), indicating incipient view collapse. At convergence (≈epochs 8–10), entropy stabilizes at

H \approx 1.03

bits with regularization and

H \approx 0.78

bits without it; the partial collapse in the unregularized model is consistent with the

- 0.7

-point performance deficit reported in Table 6.

4.8. Error Analysis

Analysis of the 142 misclassified test instances (BiomedBERT, seed 42) revealed three systematic patterns: (i) int/effect confusion, driven by shared surface phrasing and the extreme scarcity of int training data; (ii) long-range context bridge dilution, where drug pairs separated by more than 60 tokens reduce the mean-pooled context signal; and (iii) false-positive predictions (≈21% of errors) from incidental co-administration language that superficially resembles an effect trigger. These patterns motivate span-level attention for long-range cases, targeted data augmentation for int, and negation-aware pre-processing.

Figure 6 reports the recall-normalized confusion matrix for MG-ECF (BiomedBERT, seed 42) on the official test split.

The dominant off-diagonal pattern is the int→effect confusion (30 out of 96 int instances,

31.3 %

miss-rate): the 177 int training instances are insufficient for the 110 M parameter encoder to reliably distinguish laconic interaction mentions from the semantically adjacent effect class, which shares surface-level outcome language (“leading to increased serum concentrations”, “may result in altered plasma levels”). The mechanism class shows the highest precision (

P = 0.975

): pharmacokinetic trigger phrases (e.g., “inhibitor of cytochrome P450 3A4”) are syntactically distinctive and captured with high specificity by the inter-entity context view. The advise class achieves the highest recall (

R = 0.977

): modal regulatory verbs (should not be used with, is contraindicated with) are highly predictive surface cues that the model learns to associate reliably with the advisory label. The 30 false-positive predictions (

21 %

of all errors)—false instances misclassified across all four positive classes (advise: 14, effect: 9, mechanism: 5, int: 2)—arise from co-administration sentences that report clinical outcomes without expressing a true pharmacological interaction.

Figure 7 presents two annotated DDI-2013 test instances that illustrate the main success and failure modes. Example 1 (DDI-DrugBank.d580.s1.p0) is correctly classified as mechanism: the inter-entity phrase “inhibitor of cytochrome P450 3A4” is a pharmacokinetic trigger captured with high specificity by the context bridge. Example 2 (DDI-DrugBank.d709.s1.p7) is an int→effect error: the outcome phrase “leading to increased serum concentrations” activates the effect prototype, and the model’s insufficient int training signal (177 instances,

0.7 %

of training pairs) cannot override this surface cue.

5. Discussion

We found that explicitly extracting and adaptively fusing representations at three complementary granularity levels, namely entity-level, inter-entity contextual, and global sentence-level, yields more discriminative DDI classification than any single-view baseline, establishing a new state of the art on DDI-2013 with both BioBERT and BiomedBERT backbones. Concretely, MG-ECF reaches

μ F_{1} = 90.55 %

with BiomedBERT and

88.8 %

with BioBERT, corresponding to a

+ 2.7

absolute-point gain over the strongest previously reported PLM-based system, BioMCL-DDI (Table 3). The following subsections interpret this result mechanistically, contrast MG-ECF with alternative paradigms, examine the residual error on the rare int class, and assess the practical and validity implications of our conclusions.

5.1. Interpreting MG-ECF’s Gains

Our ablation (Table 4) shows that replacing a [CLS]-only head with the full three-view adaptive fusion improves

μ F_{1}

by

3.2

points on BiomedBERT and

3.1

points on BioBERT, consistently across three seeds. The three granularities capture genuinely complementary evidence. The entity view encodes what the drugs are, the inter-entity context view captures how they are related linguistically, and the global [CLS] view encodes the overall discourse framing.

A naive fusion without temperature scaling collapsed rapidly onto the global view in our preliminary experiments. The temperature

τ = 4.0

, entropy regularizer

λ = 0.3

, and view-dropout

p_{view} = 0.3

jointly prevent this collapse: the monotonic ablation improvement (Table 4) and the class-stratified gate weights (Figure 5) confirm that all three streams are genuinely exploited and not redundant.

5.2. The Int Bottleneck

Despite the state-of-the-art aggregate numbers, the int class remains the dominant residual error source (

F_{1} = 0.662

on BiomedBERT; Table 5). MG-ECF achieves high precision (

P = 0.879

) but poor recall (

R = 0.531

): the 177 training instances are simply too few for a 110 M parameter encoder to reliably separate laconic mentions such as “X interacts with Y” from the false class. Closing this gap will likely require targeted data augmentation or distant supervision from DrugBank, rather than architectural changes alone.

5.3. Threats to Validity and Limitations

We close this section by noting the main limitations of our study. (i) Single corpus: all experiments use DDI-2013; generalization to BC7 DrugProt or cross-lingual settings is untested. (ii) Single language: the corpus is English-only. (iii) Hyperparameter search: all hyperparameters were selected on the official development split; sensitivity to the search procedure is not separately measured. (iv) Seed budget: results use three seeds

{42, 123, 456}

; the BiomedBERT standard deviation (<0.001) is already low, but a larger budget would tighten confidence intervals further. (v) Prior-work reporting: numbers for prior systems are taken directly from their papers; minor re-implementation differences may account for a small fraction of the

+ 2.7

point gap, though its magnitude and cross-backbone consistency argue against an artefact explanation. None of these limitations undermines the central finding that multi-granularity fusion with a regularized gate is an effective add-on to biomedical PLM encoders for DDI extraction.

A key open question is whether the multi-granularity fusion approach generalizes beyond DDI-2013. We discuss three relevant scenarios.

ChemProt and BC7 DrugProt. ChemProt [41] and the BC7 DrugProt track [42] address chemical–gene and drug–gene relation extraction, respectively. Both use the same sentence-level candidate-pair format as DDI-2013, so the entity marker insertion, inter-entity context window, and [CLS] pooling operations in MG-ECF are directly applicable with no architectural change beyond label-layer replacement. The relative contribution of the inter-entity context view may shift, however, as ChemProt sentences tend to be longer and syntactically more complex; the span-average pooling we use for the context bridge may require replacement by a learned attention mechanism in these settings. Empirical evaluation on these corpora is our most immediate planned extension.

Clinical narratives (MIMIC-III, n2c2). Clinical notes differ substantially from PubMed abstracts in register (telegraphic, with implicit subject entities), vocabulary (brand names, dosing instructions, abbreviations), and noise level. The BioBERT and BiomedBERT backbones are pre-trained exclusively on PubMed and PMC abstracts; significant domain shift is expected when applying these encoders to clinical text without further adaptation. Substituting a ClinicalBERT or Bio+ClinicalBERT backbone would be the natural first step before evaluation on clinical corpora, and the multi-granularity fusion layer would remain unchanged.

Cross-lingual settings. DDI-2013 is English-only; extension to the Spanish DDI corpus or multilingual biomedical benchmarks would require a multilingual encoder (mBERT, XLM-R-BioMed) or language-specific pre-training. The architecture itself is language-agnostic beyond the encoder, making such extensions straightforward in principle.

Drug-label and EHR corpora (TAC-DDI, real-world EHR text). Drug-label-derived corpora and de-identified clinical notes represent challenging domain-shift scenarios: regulatory language differs substantially from PubMed prose, and clinical notes are telegraphic and noisy. Without domain adaptation, we expect reduced performance, as the BioBERT and BiomedBERT backbones are pre-trained on PubMed and PMC text only. A transfer experiment on these corpora falls outside the scope of the present study and is identified as our most immediate planned extension alongside ChemProt/BC7 DrugProt.

6. Conclusions

Relation classification in biomedical text is fundamentally a multi-signal problem: determining how two drugs interact requires understanding what they are, how they are linguistically connected, and what the surrounding context implies. However, most existing approaches collapse this rich structure into a single representation, limiting their ability to capture complementary information. MG-ECF addresses this limitation by extracting three complementary representations, entity-level, inter-entity contextual, and global sentence-level, from a shared biomedical encoder, and integrating them through a temperature-scaled gating mechanism regularized by view-dropout and an entropy constraint. Combined with focal loss to mitigate class imbalance, this design provides both a principled and effective solution for DDI classification.

Evaluated on the DDI-2013 benchmark under the official protocol, MG-ECF achieves a micro-

F_{1}

of 90.55% with BiomedBERT and 88.8% with BioBERT, outperforming the strongest prior PLM-based system (BioMCL-DDI [17],

μ F_{1} = 87.8 %

) by 2.7 percentage points. Ablation results confirm that all three representations contribute non-redundantly, while adaptive gating consistently improves over static fusion, indicating that multi-granularity fusion is a general modeling principle rather than backbone-specific.

The remaining limitation on the rare int class highlights the impact of data scarcity and suggests that future improvements will require data-centric strategies such as augmentation or distant supervision. Extending MG-ECF to other benchmarks, such as ChemProt and DrugProt, and combining it with knowledge-injection methods represent promising directions.

One architectural limitation deserves note: the inter-entity context bridge uses mean pooling over the token span between entities; for drug pairs separated by more than 60 tokens the pooled signal is diluted, and a learned span-level attention mechanism would be more robust for such cases. A full discussion of study-level limitations is provided in Section 5.3.

Several research directions are identified for extending this work. Multi-task learning: jointly fine-tuning MG-ECF on DDI-2013 and ChemProt/BC7 DrugProt would provide richer supervision and regularize the encoder across relation types, potentially alleviating the int data scarcity problem through transfer from related interaction categories. Graph-augmented encoders: integrating drug knowledge-graph embeddings from DrugBank or ChEMBL into the entity view could supply pharmacological identity signals beyond surface text tokens, particularly for rare or proprietary drug names not well represented in PubMed pre-training corpora. Span-level attention: replacing mean pooling of the inter-entity context with a learned attention mechanism would address the long-range dilution failure mode and improve performance on sentences where the critical relational cue is distant from both entity mentions. Cross-lingual and clinical transfer: evaluating with a clinical or multilingual encoder backbone (ClinicalBERT, XLM-R-BioMed) on MIMIC-III and the Spanish DDI corpus would establish the breadth of the multi-granularity fusion principle across domains and languages.

Overall, this work shows that explicitly modeling and adaptively fusing multi-granularity representations is a key step toward more accurate and robust biomedical relation extraction systems. The interpretable gate weights, single-pass inference overhead, and compact parameter footprint (

+ 3.8 %

) indicate that MG-ECF is a promising research-stage candidate for practical deployment once validated on broader corpora. Beyond the benchmark setting, the architecture and results suggest potential applicability as a research-stage component within Internet-scale pharmacovigilance workflows, cloud-based drug safety monitoring services, and networked clinical decision support systems, pending real-world clinical evaluation and domain-shift validation. Scalable automated extraction of drug interactions from continuously growing online biomedical repositories is an increasingly critical operational need, and multi-granularity fusion provides an architecturally sound foundation for addressing it. The per-instance gate weights additionally offer a lightweight explainability signal, indicating which representational view (drug identity, relational context, or global discourse) drove each prediction, which may prove useful in clinical-safety auditing contexts as part of a broader human-in-the-loop review pipeline.

Author Contributions

Conceptualization, H.C., L.C. and E.H.N.; methodology, H.C. and E.H.N.; validation, H.C., L.C. and E.H.N.; formal analysis, H.C.; investigation, H.C.; resources, H.C.; data curation, H.C.; writing—original draft preparation, H.C.; writing—review and editing, H.C., L.C. and E.H.N.; supervision, L.C. and E.H.N.; project administration, E.H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The DDI-2013 corpus [1,2] is publicly available from the DDIExtraction shared-task organizers. The source code implementing the MG-ECF framework, together with the full training and evaluation scripts and the configuration files required to reproduce all reported results, will be made available in a public GitHub repository upon publication.

Acknowledgments

The authors thank the organizers of the DDIExtraction 2013 shared task for releasing the benchmark corpus, and the developers of BioBERT and BiomedBERT for making their pre-trained models publicly available. We also acknowledge the open-source community for the PyTorch 2.10.0, Hugging Face Transformers, and scientific Python 3.12 ecosystems, which were essential to this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Segura-Bedmar, I.; Martínez, P.; Herrero-Zazo, M. SemEval-2013 Task 9: Extraction of drug-drug interactions from biomedical texts (DDIExtraction 2013). In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA, USA, 13–15 June 2013; pp. 341–350. [Google Scholar]
Herrero-Zazo, M.; Segura-Bedmar, I.; Martínez, P.; Declerck, T. The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions. J. Biomed. Inform. 2013, 46, 914–920. [Google Scholar] [CrossRef] [PubMed]
Businaro, R. Why we need an efficient and careful pharmacovigilance? J. Pharmacovigil. 2013, 1, e110. [Google Scholar] [CrossRef]
Percha, B.; Altman, R.B. Informatics confronts drug-drug interactions. Trends Pharmacol. Sci. 2013, 34, 178–184. [Google Scholar] [CrossRef] [PubMed]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 2021, 3, 1–23. [Google Scholar] [CrossRef]
Zhang, Y.; Zheng, W.; Lin, H.; Wang, J.; Yang, Z.; Dumontier, M. Drug-drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics 2018, 34, 828–835. [Google Scholar] [CrossRef] [PubMed]
Asada, M.; Miwa, M.; Sasaki, Y. Using drug descriptions and molecular structures for drug-drug interaction extraction from literature. Bioinformatics 2021, 37, 1739–1746. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Li, L.; Lu, H.; Zhou, A.; Qin, X. Extracting drug-drug interactions from texts with BioBERT and multiple entity-aware attentions. J. Biomed. Inform. 2020, 106, 103451. [Google Scholar] [CrossRef] [PubMed]
Fatehifar, M.; Karshenas, H. Drug-drug interaction extraction using a position and similarity fusion-based attention mechanism. J. Biomed. Inform. 2021, 115, 103707. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Baldini Soares, L.; FitzGerald, N.; Ling, J.; Kwiatkowski, T. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 2895–2905. [Google Scholar] [CrossRef]
Wu, S.; He, Y. Enriching pre-trained language model with entity information for relation classification. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), Beijing, China, 3–7 November 2019; pp. 2361–2364. [Google Scholar] [CrossRef]
Mondal, I. BERTChem-DDI: Improved drug-drug interaction prediction from text using chemical structure information. In Proceedings of the Knowledgeable NLP: The First Workshop on Integrating Structured Knowledge and Neural Networks for NLP, Suzhou, China, 7 December 2020; pp. 27–32. [Google Scholar] [CrossRef]
Huang, Z.; An, N.; Liu, J.; Ren, F. EMSI-BERT: Asymmetrical Entity-Mask Strategy and Symbol-Insert Structure for Drug–Drug Interaction Extraction Based on BERT. Symmetry 2023, 15, 398. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, S.; Zhang, H.; Xie, P.; Jia, Y. Optimized Drug-Drug Interaction Extraction with BioGPT and Focal Loss-Based Attention. IEEE J. Biomed. Health Inform. 2025, 29, 4560–4570. [Google Scholar] [CrossRef] [PubMed]
Jia, Y.; Yuan, Z.; Zhu, L.; Xiang, Z.l. A meta-contrastive learning approach for clinical drug-drug interaction extraction from biomedical literature. PLoS Comput. Biol. 2025, 21, e1013722. [Google Scholar] [CrossRef] [PubMed]
Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with Gumbel-Softmax. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Reimers, N.; Gurevych, I. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; pp. 338–348. [Google Scholar] [CrossRef]
Dodge, J.; Gururangan, S.; Card, D.; Schwartz, R.; Smith, N.A. Show your work: Improved reporting of experimental results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2185–2194. [Google Scholar] [CrossRef]
Segura-Bedmar, I.; Martínez, P.; Herrero-Zazo, M. Lessons learnt from the DDIExtraction-2013 shared task. J. Biomed. Inform. 2014, 51, 152–164. [Google Scholar] [CrossRef]
Zhang, T.; Leng, J.; Liu, Y. Deep learning for drug-drug interaction extraction from the literature: A review. Brief. Bioinform. 2020, 21, 1609–1627. [Google Scholar] [CrossRef]
Segura-Bedmar, I.; Martínez, P.; de Pablo-Sánchez, C. Using a shallow linguistic kernel for drug-drug interaction extraction. J. Biomed. Inform. 2011, 44, 789–804. [Google Scholar] [CrossRef]
Björne, J.; Kaewphan, S.; Salakoski, T. UTurku: Drug named entity recognition and drug-drug interaction extraction using SVM classification and domain knowledge. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA, USA, 13–15 June 2013; pp. 651–659. [Google Scholar]
Thomas, P.; Neves, M.; Rocktäschel, T.; Leser, U. WBI-DDI: Drug-drug interaction extraction using majority voting. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA, USA, 13–15 June 2013; pp. 628–635. [Google Scholar]
Chowdhury, M.F.M.; Lavelli, A. FBK-irst: A multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA, USA, 13–15 June 2013; pp. 351–355. [Google Scholar]
Kim, S.; Liu, H.; Yeganova, L.; Wilbur, W.J. Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach. J. Biomed. Inform. 2015, 55, 23–30. [Google Scholar] [CrossRef]
Raihani, A.; Laachfoubi, N. Extracting drug-drug interactions from biomedical text using a feature-based kernel approach. J. Theor. Appl. Inf. Technol. 2016, 92, 109–120. [Google Scholar]
Liu, S.; Tang, B.; Chen, Q.; Wang, X. Drug-drug interaction extraction via convolutional neural networks. Comput. Math. Methods Med. 2016, 2016, 6918381. [Google Scholar] [CrossRef]
Zhao, Z.; Yang, Z.; Luo, L.; Lin, H.; Wang, J. Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics 2016, 32, 3444–3453. [Google Scholar] [CrossRef]
Quan, C.; Hua, L.; Sun, X.; Bai, W. Multichannel convolutional neural network for biological relation extraction. BioMed Res. Int. 2016, 2016, 1850404. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; Dong, K.; Ma, L.; Sutcliffe, R.; He, F.; Chen, S.; Feng, J. Drug-drug interaction extraction via recurrent hybrid convolutional neural networks with an improved focal loss. Entropy 2019, 21, 37. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 8024–8035. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP), Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Krallinger, M.; Rabal, O.; Akhondi, S.A.; Pérez Pérez, M.; Santamaría, J.; Pérez Rodríguez, G.; Tsatsaronis, G.; Intxaurrondo, A.; Baso López, J.A.; Nandal, U.K.; et al. Overview of the BioCreative VI Chemical–Protein Interaction Track. In Proceedings of the Sixth BioCreative Challenge Evaluation Workshop, Bethesda, MD, USA, 18–20 October 2017; Volume 1, pp. 141–146. [Google Scholar]
Miranda, A.; Mehryary, F.; Luoma, J.; Pyysalo, S.; Valencia, A.; Krallinger, M. Overview of DrugProt BioCreative VII track: Quality evaluation and large scale text mining of drug-gene/protein relations. In Proceedings of the Seventh BioCreative Challenge Evaluation Workshop, Online, 8–10 November 2021. [Google Scholar]

Figure 1. Overall architecture of MG-ECF. A shared biomedical transformer encoder produces three parallel representational streams, each visually distinguished by color: entity-level (

h_{ent}

, blue), inter-entity contextual (

h_{ctx}

, orange), and global (

h_{cls}

, green). The three streams are projected to a common space and fused by a temperature-scaled gating network (purple), whose per-instance weights

α

are regularized by an entropy penalty and view-dropout. The fused representation is passed to a two-layer focal-loss classification head.

Figure 1. Overall architecture of MG-ECF. A shared biomedical transformer encoder produces three parallel representational streams, each visually distinguished by color: entity-level (

h_{ent}

, blue), inter-entity contextual (

h_{ctx}

, orange), and global (

h_{cls}

, green). The three streams are projected to a common space and fused by a temperature-scaled gating network (purple), whose per-instance weights

α

are regularized by an entropy penalty and view-dropout. The fused representation is passed to a two-layer focal-loss classification head.

Figure 2. Progress of the state of the art on DDI-2013 (

μ F_{1}

, positive classes only). Each data point corresponds to a system in Table 3 [1,7,8,9,14,16,17,26,28,31,33,34];

μ F_{1}

values are taken from the respective publications. MG-ECF (BiomedBERT) surpasses the previous best PLM comparator (BioMCL-DDI) by 2.7 percentage points.

Figure 2. Progress of the state of the art on DDI-2013 (

μ F_{1}

, positive classes only). Each data point corresponds to a system in Table 3 [1,7,8,9,14,16,17,26,28,31,33,34];

μ F_{1}

values are taken from the respective publications. MG-ECF (BiomedBERT) surpasses the previous best PLM comparator (BioMCL-DDI) by 2.7 percentage points.

Figure 3. Ablation results on the DDI-2013 test split (four-class

μ F_{1}

, positive classes only). Values are mean ± std over three seeds

{42, 123, 456}

, matching the protocol used in Table 4. Each added component produces a monotonic improvement on both backbones, confirming that all three representational views and the adaptive gate are individually necessary.

Figure 3. Ablation results on the DDI-2013 test split (four-class

μ F_{1}

, positive classes only). Values are mean ± std over three seeds

{42, 123, 456}

, matching the protocol used in Table 4. Each added component produces a monotonic improvement on both backbones, confirming that all three representational views and the adaptive gate are individually necessary.

Figure 4. Per-class

F_{1}

on the DDI-2013 test set. MG-ECF (BiomedBERT, seed 42) is compared with RHCNN [34], a representative deep learning baseline. RHCNN per-class values are taken directly from Table 4 of Sun et al. [34] (advise 80.54%, effect 73.49%, mechanism 78.25%, int 58.90%). MG-ECF achieves higher

F_{1}

on all four classes; the rare int class (

0.7 %

of training pairs) remains the largest residual error source across both systems.

Figure 4. Per-class

F_{1}

on the DDI-2013 test set. MG-ECF (BiomedBERT, seed 42) is compared with RHCNN [34], a representative deep learning baseline. RHCNN per-class values are taken directly from Table 4 of Sun et al. [34] (advise 80.54%, effect 73.49%, mechanism 78.25%, int 58.90%). MG-ECF achieves higher

F_{1}

on all four classes; the rare int class (

0.7 %

of training pairs) remains the largest residual error source across both systems.

Figure 5. Heatmap of mean gate weights per DDI class (MG-ECF, BiomedBERT, seed 42). Entity (E), Context (C), and Global (G) are all engaged; the per-class variation is modest in absolute magnitude (the color scale covers approximately

\pm 0.005

around the mean) but qualitatively coherent: advise leans on the entity view, while mechanism leans on the context bridge.

Figure 5. Heatmap of mean gate weights per DDI class (MG-ECF, BiomedBERT, seed 42). Entity (E), Context (C), and Global (G) are all engaged; the per-class variation is modest in absolute magnitude (the color scale covers approximately

\pm 0.005

around the mean) but qualitatively coherent: advise leans on the entity view, while mechanism leans on the context bridge.

Figure 6. Recall-normalized confusion matrix for MG-ECF (BiomedBERT, seed 42) on the DDI-2013 test split (5716 instances). Colour encodes row-normalized proportions; raw counts and percentages are annotated inside each cell. The dominant off-diagonal entry is int→effect (30/96, 31.3%), driven by shared surface phrasing and severe int data scarcity.

Figure 7. Two annotated DDI-2013 test instances illustrating the success and failure modes of MG-ECF (BiomedBERT, seed 42): a correctly classified mechanism (Example 1) and an int→effect misclassification (Example 2). Entity spans (DRUG₁ in blue, DRUG₂ in green) and key signal phrases (amber) are highlighted; green border = correct prediction, red = error.

Table 1. Computational cost of MG-ECF vs. the CLS-only BiomedBERT baseline. All measurements on the NVIDIA T4 GPU (16 GB VRAM) and configuration used throughout. Inference latency is per batch of 32 instances (

L_{\max} = 128

).

Table 1. Computational cost of MG-ECF vs. the CLS-only BiomedBERT baseline. All measurements on the NVIDIA T4 GPU (16 GB VRAM) and configuration used throughout. Inference latency is per batch of 32 instances (

L_{\max} = 128

).

Metric	CLS-Only (BiomedBERT)	Full MG-ECF
Trainable parameters	≈110.0 M	114.2 M (+3.8%)
Training time/epoch	≈14.5 min	≈15.1 min (+4%)
Peak GPU memory (train)	≈14.1 GB	≈14.6 GB (+3.5%)
Inference latency (ms/32)	≈330 ms	≈340 ms (+3%)

Table 2. DDI-2013 corpus statistics by split and interaction type. Percentages represent the class distribution within each split.

Split	Mechanism	Effect	Advise	Int	False	Total
Train	1163 (4.6%)	1486 (5.9%)	725 (2.9%)	177 (0.7%)	21,745 (86.0%)	25,296
Dev	156 (6.2%)	201 (8.1%)	101 (4.0%)	11 (0.4%)	2027 (81.2%)	2496
Test	302 (5.3%)	360 (6.3%)	221 (3.9%)	96 (1.7%)	4737 (82.9%)	5716

Table 3. Test-set performance on DDI-2013 (positive classes only). MG-ECF results (marked *) are mean ± std over three random seeds

{42, 123, 456}

. Numbers for prior systems are taken directly from the respective publications. ^‡ Only

μ F_{1}

is reported in the source paper for this entry; P and R are not separately stated. Best overall values are in bold.

Table 3. Test-set performance on DDI-2013 (positive classes only). MG-ECF results (marked *) are mean ± std over three random seeds

{42, 123, 456}

. Numbers for prior systems are taken directly from the respective publications. ^‡ Only

μ F_{1}

is reported in the source paper for this entry; P and R are not separately stated. Best overall values are in bold.

Generation	System	P	R	$μ F_{1}$
ML-based	Segura-Bedmar et al. [1]	0.510	0.728	0.600
	Björne et al. [26]	0.732	0.499	0.594
	Chowdhury [28]	0.646	0.656	0.651
Neural (no PLM)	Liu et al. [31]	0.757	0.647	0.698
	Quan et al. [33]	0.760	0.653	0.702
	Zhang et al. [7]	0.741	0.718	0.729
	Sun et al. [34]	0.773	0.738	0.755
PLM-based	Zhu et al. [9]	0.810	0.809	0.809
	EMSI-BERT [15]	0.832	0.807	0.820
	BERTChem ^‡ [14]	—	—	0.841
	Asada et al. [8]	0.854	0.828	0.841
	Yuan et al. [16]	0.868	0.865	0.866
	Jia et al. [17]	0.881	0.875	0.878
MG-ECF (ours)	+BioBERT *	0.921	0.857	0.888 ± 0.008
MG-ECF (ours)	+BiomedBERT *	0.924	0.888	0.905 ± 0.0004

Table 4. Ablation on the DDI-2013 test split. Results are mean ± std over three seeds

{42, 123, 456}

. Best values per backbone are in bold.

Table 4. Ablation on the DDI-2013 test split. Results are mean ± std over three seeds

{42, 123, 456}

. Best values per backbone are in bold.

Model Variant	BiomedBERT		BioBERT
Model Variant	$μ F_{1}$	Macro- $F_{1}$	$μ F_{1}$	Macro- $F_{1}$
CLS-only (standard fine-tuning)	0.873 ± 0.002	0.814	0.857 ± 0.006	0.803
Entity-only (markers, no context)	0.882 ± 0.002	0.825	0.869 ± 0.003	0.821
w/o Context Bridge (Entity + Global)	0.895 ± 0.001	0.840	0.878 ± 0.002	0.829
Concat fusion (no gate)	0.899 ± 0.001	0.849	0.882 ± 0.002	0.831
Full MG-ECF	0.905 ± 0.0004	0.861	0.888 ± 0.008	0.834

Table 5. Per-class performance of MG-ECF (seed 42) on the official DDI-2013 test split for both backbone variants shown side by side. Support is the number of positive test pairs per class. Both columns report a single representative seed; three-seed means are BiomedBERT

μ F_{1} = 0.905 \pm 0.0004

and BioBERT

μ F_{1} = 0.888 \pm 0.008

(Table 4).

Table 5. Per-class performance of MG-ECF (seed 42) on the official DDI-2013 test split for both backbone variants shown side by side. Support is the number of positive test pairs per class. Both columns report a single representative seed; three-seed means are BiomedBERT

μ F_{1} = 0.905 \pm 0.0004

and BioBERT

μ F_{1} = 0.888 \pm 0.008

(Table 4).

Class	BiomedBERT			BioBERT			Support
Class	P	R	$F_{1}$	P	R	$F_{1}$	Support
mechanism	0.975	0.920	0.947	0.967	0.873	0.918	302
effect	0.887	0.894	0.891	0.872	0.898	0.885	360
advise	0.939	0.977	0.958	0.933	0.943	0.938	221
int	0.879	0.531	0.662	0.918	0.476	0.626	96
micro avg	0.926	0.886	0.906	0.921	0.857	0.888	979
macro avg	0.920	0.831	0.864	0.922	0.797	0.834	979

Table 6. Hyperparameter sensitivity on the DDI-2013 test set (BiomedBERT, seed 42). Default values are underlined.

μ F_{1}

is reported. All values are from a single seed; differences smaller than

\pm 0.005

may fall within plausible seed variance.

Table 6. Hyperparameter sensitivity on the DDI-2013 test set (BiomedBERT, seed 42). Default values are underlined.

μ F_{1}

is reported. All values are from a single seed; differences smaller than

\pm 0.005

may fall within plausible seed variance.

Hyperparameter	Value	$μ F_{1}$
Gate temperature $τ$	1.0	0.894
	2.0	0.899
	4.0	0.906
	8.0	0.901
Focal $γ$	0.0 (CE)	0.878
	1.0	0.895
	2.0	0.906
	3.0	0.901
Entropy weight $λ$	0.0	0.899
	0.3	0.906
	0.5	0.903
	1.0	0.897
View dropout $p_{view}$	0.0	0.897
	0.1	0.902
	0.3	0.906
	0.5	0.898

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chanaa, H.; Chakir, L.; Nfaoui, E.H. MG-ECF: Multi-Granularity Entity-Context Fusion for Drug–Drug Interaction Extraction. Future Internet 2026, 18, 289. https://doi.org/10.3390/fi18060289

AMA Style

Chanaa H, Chakir L, Nfaoui EH. MG-ECF: Multi-Granularity Entity-Context Fusion for Drug–Drug Interaction Extraction. Future Internet. 2026; 18(6):289. https://doi.org/10.3390/fi18060289

Chicago/Turabian Style

Chanaa, Hiba, Loqman Chakir, and El Habib Nfaoui. 2026. "MG-ECF: Multi-Granularity Entity-Context Fusion for Drug–Drug Interaction Extraction" Future Internet 18, no. 6: 289. https://doi.org/10.3390/fi18060289

APA Style

Chanaa, H., Chakir, L., & Nfaoui, E. H. (2026). MG-ECF: Multi-Granularity Entity-Context Fusion for Drug–Drug Interaction Extraction. Future Internet, 18(6), 289. https://doi.org/10.3390/fi18060289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MG-ECF: Multi-Granularity Entity-Context Fusion for Drug–Drug Interaction Extraction

Abstract

1. Introduction

2. Related Work

2.1. Rule-Based and Linguistic-Kernel Methods

2.2. Classical Machine Learning Methods

2.3. Deep Learning Methods

2.4. Contextualized Language-Model-Based Methods

3. Materials and Methods

3.1. Problem Formulation

3.2. Input Representation with Entity Markers

3.3. Contextual Encoding

3.4. Multi-Granularity Feature Extraction

3.5. Adaptive Multi-View Fusion

3.5.1. Representation Projection

3.5.2. Temperature-Scaled Gating

3.5.3. View-Dropout Regularization

3.5.4. Entropy Regularization of the Gate

3.6. Classification and Imbalance-Aware Loss

3.7. Training Algorithm

3.8. Implementation Details

4. Results

4.1. Dataset and Evaluation Protocol

4.2. Experimental Setup

4.3. Overall Performance and Comparison with the State of the Art

4.4. Ablation Study

4.5. Per-Class Analysis

4.6. Hyperparameter Sensitivity

4.7. Analysis of the Learned Gating

4.8. Error Analysis

5. Discussion

5.1. Interpreting MG-ECF’s Gains

5.2. The Int Bottleneck

5.3. Threats to Validity and Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI