1. Introduction
Drug–drug interaction (DDI) detection and classification from biomedical text is a critical task in clinical pharmacovigilance [
1,
2]. Adverse DDIs are among the leading causes of preventable medication-related complications, contributing to hospital admissions, adverse drug events, and clinical decision-making errors [
3,
4]. As biomedical knowledge continues to expand through Internet-accessible repositories, online literature databases, and networked health platforms, automated DDI extraction has become an essential component of large-scale drug safety surveillance and cloud-based clinical decision support systems. Advances in deep learning and domain-specific pre-trained language models (PLMs) such as BioBERT [
5] and BiomedBERT [
6] have substantially improved DDI classification, and the DDI-2013 shared-task corpus [
1,
2] has become the de facto benchmark for comparing these systems. Despite this progress, DDI classification remains challenging.
Two structural difficulties persist. First, the DDI-2013 corpus is severely class-imbalanced: negative (non-interacting) pairs account for more than
of instances [
7,
8], which biases gradient-based optimization toward the majority class. Second, the semantic distinction between interaction types is subtle and context-dependent: distinguishing a pharmacokinetic
mechanism from a clinical
effect or prescriptive
advise often requires capturing evidence that is syntactically distant from the target drug entities [
9,
10]. Most existing PLM-based approaches address these challenges by fine-tuning on a single representation source, typically the
[CLS] token [
11] or entity-marker embeddings [
12,
13], which collapses the sentence into one view and discards the complementary evidence distributed across entity, inter-entity, and global contextual regions.
Prior work has explored enriching PLM representations with entity-aware attention and external drug information [
8,
9,
14] and deep recurrent architectures with attention mechanisms [
10], yet none of these lines of work jointly models all three granularity levels entity-specific, inter-entity contextual, and global sentence-level within a single adaptive fusion framework. Comparisons across these studies are further complicated by inconsistent experimental protocols: some works merge the official training and development sets while others do not, making it difficult to isolate architectural gains from data-preparation choices [
1].
Several recent DDI extraction systems have pursued complementary strategies. Zhu et al. [
9] combine entity-aware attention layers over BioBERT and BiGRU outputs, yet do not isolate entity spans and inter-entity context as separately pooled views from a single shared encoder. EMSI-BERT [
15] constructs multiple symbol-inserted input variants and extracts a
[CLS] representation per variant, fusing at the input level with no adaptive per-instance gate. BioFocal-DDI [
16] combines BioGPT data augmentation with a BioBERT–BiLSTM–ReGCN pipeline, but does not extract dedicated entity-span or inter-entity context views as separate streams. BioMCL-DDI [
17], the current strongest baseline (
), applies meta-contrastive learning to a single
[CLS] vector without separately pooling entity spans or inter-entity context. To the best of our knowledge, MG-ECF is the first DDI extraction system that simultaneously (i) extracts entity-span, inter-entity context, and global
[CLS] representations from a single shared biomedical encoder, (ii) fuses them through a temperature-scaled, entropy-regularized gate whose weights are computed per instance, and (iii) combines this multi-granularity fusion with view-dropout regularization in a single unified training objective. A full system-by-system comparison is provided in
Section 2.
We ask, can explicitly extracting and adaptively fusing representations at three complementary granularity levels yield more discriminative DDI classification than any single-view baseline? To answer this, we propose MG-ECF (Multi-Granularity Entity-Context Fusion), a framework that extracts three parallel feature streams from a shared biomedical transformer encoder and combines them through a temperature-scaled gating mechanism [
18,
19] regularized by view-dropout [
20] and an entropy penalty. We evaluate MG-ECF on the DDI-2013 benchmark under the official train/dev/test protocol [
1] using both BioBERT and BiomedBERT, with multi-seed experiments to ensure statistically reliable estimates [
21,
22].
The remainder of this paper is organized as follows.
Section 2 surveys related work;
Section 3 presents the proposed methodology;
Section 4 reports experimental results and ablation analyses;
Section 5 discusses model behavior and limitations; and
Section 6 concludes with directions for future work.
The main contributions of this work are as follows:
- 1.
We propose MG-ECF, a multi-granularity fusion architecture that jointly integrates entity-level, inter-entity contextual, and global sentence representations for biomedical relation classification.
- 2.
We introduce an adaptive temperature-scaled gating mechanism with view-dropout regularization, enabling robust, instance-aware feature fusion.
- 3.
We conduct an extensive empirical evaluation on DDI-2013 using BioBERT and BiomedBERT, including multi-seed experiments and ablation studies that quantify the contribution of each representational component.
- 4.
We employ a class-balanced focal loss objective to directly address the severe label imbalance inherent in the DDI-2013 corpus, where the negative class constitutes over of training pairs. By down-weighting easy majority-class examples and focusing gradient updates on hard minority samples, this strategy enables more reliable learning across the clinically significant but under-represented interaction types.
4. Results
4.1. Dataset and Evaluation Protocol
We evaluated MG-ECF on the DDI-2013 corpus [
1,
2], the de facto benchmark for drug–drug interaction extraction. The corpus aggregates drug descriptions from DrugBank and MEDLINE abstracts, annotated at the candidate-pair level with five labels:
mechanism,
effect,
advise,
int, and the negative class
false. We used the official training, development, and test splits without any additional pre-processing beyond the entity-marker insertion described in
Section 3. The resulting sample counts are given in
Table 2. The distribution is strongly skewed toward the negative class, which accounts for
of the training pairs, and the
int class is particularly rare (≈0.7% of training pairs and only 96 test pairs).
Table 2 highlights the long-tail label regime that motivates our imbalance-aware objective and the class-specific analyses reported later.
Following the official DDI-2013 protocol, we report micro-averaged precision (P), recall (R), and
(
) computed over the four positive classes only; the
false class is excluded from the aggregate, as is standard [
1]. We additionally report the macro-
to expose the influence of rare classes, and per-class
for a fine-grained view.
4.2. Experimental Setup
This subsection describes the experimental design: backbone selection, training configuration, and evaluation protocol. Software and hardware specifics are covered in
Section 3.8.
We instantiated MG-ECF on two domain-specific pre-trained language models: BioBERT (
dmis-lab/biobert-base-cased-v1.2) [
5] and BiomedBERT/PubMedBERT (
microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) [
6]. Both are base-size models (
, 12 layers, 12 attention heads,
M parameters).
We trained with AdamW [
37,
40] using a linear warm-up schedule (warmup ratio 0.1) and a per-step batch size of 16 with
gradient accumulation (effective batch size 32). All hyperparameters were fixed: weight decay 0.01, maximum sequence length 128, learning rate
, up to 12 epochs with early stopping on development
(patience 5–8), focal loss [
35] with
and inverse-frequency class weights
, gate temperature
, view-dropout
, and entropy regularizer
. These values were selected by grid search on the official development set prior to any test-set evaluation; the search grids were
,
,
, and
(full sensitivity reported in
Section 4.6). The three random seeds
were fixed before any experiment and no seed was selected or filtered based on test-set performance; all MG-ECF results are reported as mean ± standard deviation, following Reimers and Gurevych [
21] and Dodge et al. [
22].
For every run, the checkpoint with the highest development over the four positive classes was used at test time. No hyperparameter was tuned on the test set.
4.3. Overall Performance and Comparison with the State of the Art
Table 3 compares MG-ECF against representative systems spanning three methodological generations. Numbers for prior systems are taken from the corresponding papers; all entries use the official DDI-2013 test split and the four-class
metric, making the comparison strictly like-for-like. Prior-system results are single-point estimates: none of the baseline publications reports multi-seed variance, making it impossible to include standard deviations for those entries. Statistical significance of the MG-ECF improvement is instead established through the one-sample
t-test reported below.
Figure 2 situates our result within the historical trajectory of DDI-2013 systems and makes the generation-level performance trend explicit.
MG-ECF with BiomedBERT achieved a
of
, surpassing the strongest published PLM comparator in
Table 3 (BioMCL-DDI,
) by 2.7 absolute percentage points, and the best non-pretrained neural system [
7] by over 15 points (
Table 3). The negligible standard deviation (
) across three independent seeds confirms that the improvement is stable and not seed-dependent. To quantify statistical reliability: the
percentage-point improvement over BioMCL-DDI (
) is
larger than the full three-seed variation band (
). A one-sample
t-test against the null hypothesis
yields
,
, confirming that the observed gain cannot be attributed to random initialization. The BioBERT variant reached
, itself above all prior published results, which shows that the gains are a property of the MG-ECF architecture and not specific to the BiomedBERT pre-training corpus.
To separate the contribution of the backbone from the contribution of the fusion module,
Table 4 (
Section 4.4) includes a
CLS-only BiomedBERT baseline (standard single-vector fine-tuning) that achieves
. This baseline falls just 0.5 points short of the strongest prior PLM-based result (BioMCL-DDI,
), confirming that BiomedBERT is itself a competitive backbone; the multi-granularity fusion then adds an additional
points on top to establish the new state of the art.
4.4. Ablation Study
Table 4 presents an ablation study in which architectural components are progressively added to MG-ECF, with all variants evaluated on both backbones. For both backbones, all five variants are reported as mean ± std over three seeds
. The five variants are: (i)
CLS-only, standard single-vector fine-tuning; (ii)
Entity-only, entity-marker pooling without a context bridge or gate; (iii)
w/o Context Bridge, entity and global views fused without the inter-entity context stream; (iv)
Concat fusion, all three views combined with a fixed linear head (no gate); and (v)
Full MG-ECF with the temperature-scaled gate.
Figure 3 visualizes the same monotonic improvement pattern reported in
Table 4 for both backbones.
All three views contributed non-redundantly on BiomedBERT. The largest single drop came from removing all specialized structures (
points,
CLS-only vs. full), confirming that standard single-vector fine-tuning under-uses the relational structure of DDI sentences. Adding entity markers recovered
points (
Entity-only), consistent with the established benefit of marker-based representations [
12,
13]. Adding the global sentence representation contributed a further
points (
Entity-only vs.
w/o Context Bridge), confirming that long-range sentence context not captured by entity spans alone is necessary for type disambiguation. The inter-entity context bridge then added
points (
w/o Context Bridge vs.
Concat fusion), showing that the lexical material between the two drug spans carries relational evidence not captured by entity markers or the
[CLS] token alone. Finally, replacing the soft gate with a fixed concatenation head (
Concat fusion) lost
points, demonstrating that dynamic per-instance weighting outperforms a fixed linear combination. The BioBERT boundary rows confirm that the net architectural effect (
points, CLS-only to full) is not specific to BiomedBERT’s broader pre-training corpus.
The individual contribution of each regularization component is formally quantified in the hyperparameter sweep of
Section 4.6, which provides a dedicated per-design-choice ablation: disabling entropy regularization (
) costs
points, deactivating view-dropout (
) costs
points, and using a sharp gate (
versus the default
) costs
points. Together with the architectural ablation in
Table 4, these results confirm that both the three-view design and each of the three novel regularization choices are independently necessary for the full performance of MG-ECF.
The full MG-ECF configuration started near the CLS-only level at epoch 1 (gate near-uniform at initialization) and progressively overtook all partial variants as the entity and context streams specialized, confirming that the improvement accumulates during training rather than appearing only at the final checkpoint.
4.5. Per-Class Analysis
Table 5 reports per-class performance for both backbone variants (BiomedBERT and BioBERT, seed 42) side by side. For BiomedBERT, the model reached
on two of the four positive classes, with particularly strong performance on the safety-critical
advise class. The
int class remained the dominant error source: the decomposition
/
indicates that MG-ECF was
precise on
int but achieved low recall, consistent with the class being severely under-represented in training (177 instances,
of the training set). This pattern has been reported by every DDI-2013 system across all methodological generations [
8,
28] and is analyzed further in
Section 4.8 and
Section 5.
Comparing the BiomedBERT and BioBERT columns of
Table 5 reveals that the
pp micro-
gap between the two backbones (seed 42: 0.906 vs. 0.888) is distributed across all four classes. The per-class decomposition is:
mechanism −2.9 pp,
advise pp,
effect pp, and
int pp. The
int class shows the largest absolute gap (
vs.
), with BioBERT’s recall falling to
compared with BiomedBERT’s
. Both backbones exhibit the same high-precision/low-recall pattern on
int (BioBERT:
,
; BiomedBERT:
,
), confirming that the bottleneck is driven by chronic training scarcity (177 instances,
of training pairs) rather than backbone pre-training corpus choice.
Figure 4 complements
Table 5 by making the class-wise margin over representative baselines immediately comparable.
4.6. Hyperparameter Sensitivity
Table 6 reports the sensitivity of MG-ECF (BiomedBERT, seed 42) to four key hyperparameters: gate temperature
, focal loss
, entropy regularization weight
, and view-dropout rate
. Each hyperparameter was varied over four values while all others were held at their defaults.
Three patterns emerged from the sweep. First, a sharper gate () hurt performance by points relative to the default : a lower temperature concentrates gate mass prematurely on one view during early training, making the entropy regularizer less effective. Second, focal loss () was the single most impactful individual choice, outperforming standard cross-entropy () by points, confirming the importance of down-weighting the overwhelming negative class on this severely imbalanced corpus. Third, both the entropy coefficient () and view dropout () provided consistent gains of approximately – points over their deactivated baselines (, ), and degraded gracefully as their values were pushed beyond the optimum. Overall, MG-ECF showed low sensitivity across reasonable ranges of all four hyperparameters, with the default configuration achieving the best result in every sweep. The following section examines the learned gate weights to provide a mechanistic interpretation of how the three views are exploited per class.
4.7. Analysis of the Learned Gating
We examined the weights assigned by the temperature-scaled gate to the three views after training converged. Averaged across all test instances, the gate distributed attention nearly uniformly: entity (
), context bridge (
), and global (
). This near-uniform average is by design: the combination of
,
, and
actively prevents the gate from collapsing onto a single stream. Crucially, the ablation results in
Table 4 show that physically removing any view consistently reduced performance, providing direct evidence that all three streams were genuinely exploited and not redundant.
Per-class gate inspection revealed qualitatively coherent specialization, despite the modest absolute magnitude of the shifts (see
Figure 5).
Advise predictions, characterized by explicit regulatory phrases such as should not be used with or is contraindicated with, assigned a relatively higher weight to the entity view, consistent with advisory interactions often depending on the identity of the specific drug pair involved. In contrast,
mechanism predictions weighted the context bridge more strongly, consistent with mechanistic interactions being signalled by pharmacokinetic language between the two drug mentions (e.g. “inhibits the CYP3A4 metabolism of”). These patterns were consistent across seeds, though small in absolute magnitude; their primary value is qualitative coherence rather than quantitative magnitude.
The gate weight shifts, while modest in absolute size, are mechanistically coherent and consistent with the syntactic structure of DDI sentences. Inter-entity context dominance for
mechanism: pharmacokinetic trigger phrases such as “inhibits the CYP3A4 metabolism of” or “displaces from plasma protein binding” appear precisely in the inter-entity span; the context bridge aggregates exactly these tokens, explaining why removing it (w/o Context Bridge in
Table 4) causes the largest per-class drop on mechanism predictions. Entity view dominance for
advise: advisory interactions are often determined by the pharmacological class membership of the specific drug pair (e.g., a contraindication holds for all drugs of a particular class); the entity view encodes what the drugs are through their span representations, providing the identity signal that the global
[CLS] context alone cannot. Global view stabilization for
effect and
int: the broad discourse framing (“this combination may produce…”, “patients receiving both…”) is distributed across the whole sentence and is best captured by the global representation; the gate appropriately assigns it a relatively higher weight for these classes, where predicate-argument structure at the sentence level provides the decisive evidence.
At the default
, the mean gate entropy across all test instances is
bits, near the maximum entropy for a 3-way distribution (
bits), confirming that the gate distributes weight broadly and does not collapse onto a single view. At the sharper
, the mean entropy drops to
bits: the gate becomes quasi-one-hot, effectively ignoring two of the three views for most instances, which explains the
-point performance drop in
Table 6. The entropy regularizer (
) provides complementary protection during early training: without it (
), the gate entropy at epoch 3 is approximately
lower than with it, consistent with the task gradient not yet being strong enough to prevent premature view collapse, and with the
-point gap in
Table 6. Together, temperature softening and entropy regularization ensure that all three views are exploited throughout training, with the gate becoming progressively more instance-specific as training progresses.
The gate entropy trajectory over training epochs provides further quantitative evidence. At epoch 1 (near-initialization), gate logits are close to zero and both configurations (
and
) operate near maximum entropy (
bits). By epoch 3, the two configurations diverge: with
,
bits (92% of maximum), while without regularization (
),
bits (65% of maximum), indicating incipient view collapse. At convergence (≈epochs 8–10), entropy stabilizes at
bits with regularization and
bits without it; the partial collapse in the unregularized model is consistent with the
-point performance deficit reported in
Table 6.
4.8. Error Analysis
Analysis of the 142 misclassified test instances (BiomedBERT, seed 42) revealed three systematic patterns: (i) int/effect confusion, driven by shared surface phrasing and the extreme scarcity of int training data; (ii) long-range context bridge dilution, where drug pairs separated by more than 60 tokens reduce the mean-pooled context signal; and (iii) false-positive predictions (≈21% of errors) from incidental co-administration language that superficially resembles an effect trigger. These patterns motivate span-level attention for long-range cases, targeted data augmentation for int, and negation-aware pre-processing.
Figure 6 reports the recall-normalized confusion matrix for MG-ECF (BiomedBERT, seed 42) on the official test split.
The dominant off-diagonal pattern is the int→effect confusion (30 out of 96 int instances, miss-rate): the 177 int training instances are insufficient for the 110 M parameter encoder to reliably distinguish laconic interaction mentions from the semantically adjacent effect class, which shares surface-level outcome language (“leading to increased serum concentrations”, “may result in altered plasma levels”). The mechanism class shows the highest precision (): pharmacokinetic trigger phrases (e.g., “inhibitor of cytochrome P450 3A4”) are syntactically distinctive and captured with high specificity by the inter-entity context view. The advise class achieves the highest recall (): modal regulatory verbs (should not be used with, is contraindicated with) are highly predictive surface cues that the model learns to associate reliably with the advisory label. The 30 false-positive predictions ( of all errors)—false instances misclassified across all four positive classes (advise: 14, effect: 9, mechanism: 5, int: 2)—arise from co-administration sentences that report clinical outcomes without expressing a true pharmacological interaction.
Figure 7 presents two annotated DDI-2013 test instances that illustrate the main success and failure modes. Example 1 (
DDI-DrugBank.d580.s1.p0) is correctly classified as
mechanism: the inter-entity phrase “inhibitor of cytochrome P450 3A4” is a pharmacokinetic trigger captured with high specificity by the context bridge. Example 2 (
DDI-DrugBank.d709.s1.p7) is an
int→
effect error: the outcome phrase “leading to increased serum concentrations” activates the
effect prototype, and the model’s insufficient
int training signal (177 instances,
of training pairs) cannot override this surface cue.
5. Discussion
We found that explicitly extracting and adaptively fusing representations at three complementary granularity levels, namely entity-level, inter-entity contextual, and global sentence-level, yields more discriminative DDI classification than any single-view baseline, establishing a new state of the art on DDI-2013 with both BioBERT and BiomedBERT backbones. Concretely, MG-ECF reaches
with BiomedBERT and
with BioBERT, corresponding to a
absolute-point gain over the strongest previously reported PLM-based system, BioMCL-DDI (
Table 3). The following subsections interpret this result mechanistically, contrast MG-ECF with alternative paradigms, examine the residual error on the rare
int class, and assess the practical and validity implications of our conclusions.
5.1. Interpreting MG-ECF’s Gains
Our ablation (
Table 4) shows that replacing a
[CLS]-only head with the full three-view adaptive fusion improves
by
points on BiomedBERT and
points on BioBERT, consistently across three seeds. The three granularities capture genuinely complementary evidence. The entity view encodes what the drugs are, the inter-entity context view captures how they are related linguistically, and the global
[CLS] view encodes the overall discourse framing.
A naive fusion without temperature scaling collapsed rapidly onto the global view in our preliminary experiments. The temperature
, entropy regularizer
, and view-dropout
jointly prevent this collapse: the monotonic ablation improvement (
Table 4) and the class-stratified gate weights (
Figure 5) confirm that all three streams are genuinely exploited and not redundant.
5.2. The Int Bottleneck
Despite the state-of-the-art aggregate numbers, the
int class remains the dominant residual error source (
on BiomedBERT;
Table 5). MG-ECF achieves high precision (
) but poor recall (
): the 177 training instances are simply too few for a 110 M parameter encoder to reliably separate laconic mentions such as “
X interacts with
Y” from the
false class. Closing this gap will likely require targeted data augmentation or distant supervision from DrugBank, rather than architectural changes alone.
5.3. Threats to Validity and Limitations
We close this section by noting the main limitations of our study. (i) Single corpus: all experiments use DDI-2013; generalization to BC7 DrugProt or cross-lingual settings is untested. (ii) Single language: the corpus is English-only. (iii) Hyperparameter search: all hyperparameters were selected on the official development split; sensitivity to the search procedure is not separately measured. (iv) Seed budget: results use three seeds ; the BiomedBERT standard deviation (<0.001) is already low, but a larger budget would tighten confidence intervals further. (v) Prior-work reporting: numbers for prior systems are taken directly from their papers; minor re-implementation differences may account for a small fraction of the point gap, though its magnitude and cross-backbone consistency argue against an artefact explanation. None of these limitations undermines the central finding that multi-granularity fusion with a regularized gate is an effective add-on to biomedical PLM encoders for DDI extraction.
A key open question is whether the multi-granularity fusion approach generalizes beyond DDI-2013. We discuss three relevant scenarios.
ChemProt and BC7 DrugProt. ChemProt [
41] and the BC7 DrugProt track [
42] address chemical–gene and drug–gene relation extraction, respectively. Both use the same sentence-level candidate-pair format as DDI-2013, so the entity marker insertion, inter-entity context window, and
[CLS] pooling operations in MG-ECF are directly applicable with no architectural change beyond label-layer replacement. The relative contribution of the inter-entity context view may shift, however, as ChemProt sentences tend to be longer and syntactically more complex; the span-average pooling we use for the context bridge may require replacement by a learned attention mechanism in these settings. Empirical evaluation on these corpora is our most immediate planned extension.
Clinical narratives (MIMIC-III, n2c2). Clinical notes differ substantially from PubMed abstracts in register (telegraphic, with implicit subject entities), vocabulary (brand names, dosing instructions, abbreviations), and noise level. The BioBERT and BiomedBERT backbones are pre-trained exclusively on PubMed and PMC abstracts; significant domain shift is expected when applying these encoders to clinical text without further adaptation. Substituting a ClinicalBERT or Bio+ClinicalBERT backbone would be the natural first step before evaluation on clinical corpora, and the multi-granularity fusion layer would remain unchanged.
Cross-lingual settings. DDI-2013 is English-only; extension to the Spanish DDI corpus or multilingual biomedical benchmarks would require a multilingual encoder (mBERT, XLM-R-BioMed) or language-specific pre-training. The architecture itself is language-agnostic beyond the encoder, making such extensions straightforward in principle.
Drug-label and EHR corpora (TAC-DDI, real-world EHR text). Drug-label-derived corpora and de-identified clinical notes represent challenging domain-shift scenarios: regulatory language differs substantially from PubMed prose, and clinical notes are telegraphic and noisy. Without domain adaptation, we expect reduced performance, as the BioBERT and BiomedBERT backbones are pre-trained on PubMed and PMC text only. A transfer experiment on these corpora falls outside the scope of the present study and is identified as our most immediate planned extension alongside ChemProt/BC7 DrugProt.
6. Conclusions
Relation classification in biomedical text is fundamentally a multi-signal problem: determining how two drugs interact requires understanding what they are, how they are linguistically connected, and what the surrounding context implies. However, most existing approaches collapse this rich structure into a single representation, limiting their ability to capture complementary information. MG-ECF addresses this limitation by extracting three complementary representations, entity-level, inter-entity contextual, and global sentence-level, from a shared biomedical encoder, and integrating them through a temperature-scaled gating mechanism regularized by view-dropout and an entropy constraint. Combined with focal loss to mitigate class imbalance, this design provides both a principled and effective solution for DDI classification.
Evaluated on the DDI-2013 benchmark under the official protocol, MG-ECF achieves a micro-
of 90.55% with BiomedBERT and 88.8% with BioBERT, outperforming the strongest prior PLM-based system (BioMCL-DDI [
17],
) by 2.7 percentage points. Ablation results confirm that all three representations contribute non-redundantly, while adaptive gating consistently improves over static fusion, indicating that multi-granularity fusion is a general modeling principle rather than backbone-specific.
The remaining limitation on the rare int class highlights the impact of data scarcity and suggests that future improvements will require data-centric strategies such as augmentation or distant supervision. Extending MG-ECF to other benchmarks, such as ChemProt and DrugProt, and combining it with knowledge-injection methods represent promising directions.
One architectural limitation deserves note: the inter-entity context bridge uses mean pooling over the token span between entities; for drug pairs separated by more than 60 tokens the pooled signal is diluted, and a learned span-level attention mechanism would be more robust for such cases. A full discussion of study-level limitations is provided in
Section 5.3.
Several research directions are identified for extending this work. Multi-task learning: jointly fine-tuning MG-ECF on DDI-2013 and ChemProt/BC7 DrugProt would provide richer supervision and regularize the encoder across relation types, potentially alleviating the int data scarcity problem through transfer from related interaction categories. Graph-augmented encoders: integrating drug knowledge-graph embeddings from DrugBank or ChEMBL into the entity view could supply pharmacological identity signals beyond surface text tokens, particularly for rare or proprietary drug names not well represented in PubMed pre-training corpora. Span-level attention: replacing mean pooling of the inter-entity context with a learned attention mechanism would address the long-range dilution failure mode and improve performance on sentences where the critical relational cue is distant from both entity mentions. Cross-lingual and clinical transfer: evaluating with a clinical or multilingual encoder backbone (ClinicalBERT, XLM-R-BioMed) on MIMIC-III and the Spanish DDI corpus would establish the breadth of the multi-granularity fusion principle across domains and languages.
Overall, this work shows that explicitly modeling and adaptively fusing multi-granularity representations is a key step toward more accurate and robust biomedical relation extraction systems. The interpretable gate weights, single-pass inference overhead, and compact parameter footprint () indicate that MG-ECF is a promising research-stage candidate for practical deployment once validated on broader corpora. Beyond the benchmark setting, the architecture and results suggest potential applicability as a research-stage component within Internet-scale pharmacovigilance workflows, cloud-based drug safety monitoring services, and networked clinical decision support systems, pending real-world clinical evaluation and domain-shift validation. Scalable automated extraction of drug interactions from continuously growing online biomedical repositories is an increasingly critical operational need, and multi-granularity fusion provides an architecturally sound foundation for addressing it. The per-instance gate weights additionally offer a lightweight explainability signal, indicating which representational view (drug identity, relational context, or global discourse) drove each prediction, which may prove useful in clinical-safety auditing contexts as part of a broader human-in-the-loop review pipeline.