1. Introduction
Polypharmacy, the concurrent use of more than five active substances–commonly used in the literature [
1]–has grown considerably in cardiology, oncology, psychiatry, and geriatrics [
2]. Although it enables holistic management of multimorbidity, each additional drug increases the combinatorial space of potential drug–drug interactions (DDIs). A 2023 claims-based analysis covering four European Union states found that 78% of adults taking more than seven prescriptions experienced at least one clinically actionable DDI [
3,
4].
Clinical decision support systems (CDSSs) are the primary safeguard against such risks, as they can screen medication orders in real time and generate context-aware alerts during regimen construction. Linking polypharmacy profiles to CDSS logic, therefore, directly influences patient safety and therapeutic efficacy [
5].
However, most commercial CDSS systems treat DDIs as binary hazards [
6], ignoring their polarity. This means that interactions are simply flagged as present or absent, without indicating whether they enhance or diminish therapeutic effects. Distinguishing synergistic (therapeutically beneficial) from antagonistic (harmful or efficacy-reducing) interactions could help clinicians deliberately exploit positive synergies—such as
-lactam/
-lactamase inhibitor pairs in infectious disease—while mitigating adverse antagonism.
State-of-the-art biomedical language models such as BiomedBERT [
7] can interpret unstructured textual evidence in DrugBank, MEDLINE, or electronic health records (EHR), offering a scalable route to polarity-aware classification. The present study employs BiomedBERT as the backbone and fine-tunes it with low-rank adaptation (LoRA) [
8], allowing memory-constrained hospital servers to deploy the model without storing full-precision weight deltas.
This research bridges the intensifying demand of polypharmacy, the need for polarity-aware CDSS alerts, and a resource-efficient transformer pipeline that generates those alerts. A polarity-balanced seed corpus was assembled from three curated sources and used to fine-tune the BiomedBERT backbone via parameter-efficient LoRA adaptation. While DrugBank provides millions of interaction descriptions, only a small subset of polarity-labeled interactions was identified from DrugComb and the DrugBank antagonism file, which were used to build the supervised training set. The trained model is applied to the DrugBank database to infer synergy or antagonism across unlabeled drug–drug interaction sentences. DrugBank provides structured descriptions of drug interactions, making it suitable for polarity inference. The proposed modular pipeline enables scalable polarity classification with minimal annotation effort and prepares the labeled data for downstream CDSS integration.
This article is organized as follows:
Section 2 goes over pertinent work on polarity-aware models, transformer fine-tuning, and DDI classification techniques. The methodology in
Section 3 explains the LoRA fine-tuning technique, labeling system, and data sources. The
Section 4 lists the possible approaches to clinical integration. Comparative findings and experimental assessment are presented in
Section 5, while
Section 6 addresses important conclusions, constraints, and ramifications.
Section 7 presents the conclusions of the paper and a future research direction.
A preliminary version of this work was presented as an abstract at FMF-AI 2025, but not published as an article.
2. Literature Review
Recent rule-centric engines use metabolic ontologies and curated CYP-450 tables to detect pharmacokinetic clashes but still miss pharmacodynamic synergies. For example, in a seminal study [
9], drug database providers were found to update their interaction rule bases monthly, with each update vetted by clinical experts to maintain more than 90% precision for recently approved drugs. Noor and Assiri [
10] showed that a Tanimoto similarity threshold above 0.85 can recover over 60% of true DDIs, although recall remains limited.
Comprehensive surveys covering 2020–2024 report that graph neural networks and variational auto-encoders obtain AUROC between 0.82 and 0.86 on DrugBank pairs. Yan et al. [
11] employed a heterogeneous graph attention network(GAT) with chemical, gene-expression, and pathway edges. Liu et al. [
12] proposed the synergistic graph neural network (SynerGNet), a graph attention network tailored for predicting synergistic drug pairs in oncology. By integrating cell-line viability profiles and chemical descriptors, their model achieved a balanced accuracy of 84.1% on DrugComb-derived synergy data. However, it relies heavily on experimental omics inputs (e.g., gene expression and dose-response curves), limiting its direct applicability in real-world clinical decision support, where such data are unavailable at prescription time. Despite strong performance, the infrastructure demands and feature requirements of SynerGNet restrict its use to specialized research environments or pharmacological studies where rich experimental annotation is available.
Other studies have used domain-adapted language models such as BioBERT, PubMedBERT, and BiomedBERT, which have outperformed their general-domain counterparts by 6–10 percentage points in the F1 score (the harmonic mean of precision and recall) on relation-extraction benchmarks [
13]. Shankar et al. [
14] demonstrated that incorporating sentence-level attention explanations improved pharmacist trust ratings in a simulated medication-review task.
While domain-specific models like BioBERT, PubMedBERT, and BiomedBERT significantly improve biomedical language comprehension, they are not trained to recognize interaction polarity. These models learn general contextual embeddings but do not distinguish between synergistic and antagonistic effects unless explicitly fine-tuned on labeled polarity data. In this study, we use LoRA-based tuning to adapt BiomedBERT for this specific classification task.
Hu et al. [
8] benchmarked LoRA against adapters and prefix-tuning on eight biomedical tasks, in which it decomposes weight updates into low-rank matrices—often rank 8–16—so that only 0.5–2% of parameters are trainable. LoRA matched full fine-tuning within 0.3 F1 points while reducing video random-access memory (VRAM) usage by 12×.
In a different approach, Zhang et al. [
15] applied confidence-weighted pseudo-labels to MEDI-SPAN entries—a widely used commercial drug database containing structured information about drug interactions, dosages, and contraindications—observing a 5–7% macro-F1 gain with only 1000 human annotations. Unlike their method, which applied fixed confidence thresholds during training, our framework logs all predictions along with their confidence scores. This allows for flexible, rule-based filtering or manual review after inference, without discarding potentially informative cases prematurely.
Several earlier studies have improved predictive accuracy in DDI modeling by using molecular graph structures. One such method, for example, presented a graph neural network learning size-adaptive molecular substructures to capture chemically relevant interactions at several levels, enhancing pharmacological effect classification among compounds [
16]. This graph-based framework highlights the predictive utility of structural drug features, though it does not explicitly address sentence-level interpretability or clinical deployment.
Excessive system alerts and poor targeting still hinder the adoption of CDSS tools—nearly half of the alerts in outpatient care are ignored or overridden by clinicians [
17]. That is why intelligent systems that can have a computational understanding of medical context are crucial. AI-enhanced CDSS prototypes that incorporate interaction polarity reduced override rates to 28%, supporting the clinical relevance of polarity-aware classifiers.
In the medical field, it is especially important to understand how decisions are made. That is why explainability should be a core element in CDSS. Tanvir et al. [
18] proposed a heterogeneous attention network for drug–drug interaction prediction (HAN-DDI), a heterogeneous graph attention network trained on biomedical interaction graphs consisting of drugs, targets, enzymes, and side effects. The model achieved high performance in DDI prediction, reaching an F1 score of 95.18% for existing drugs and 82.87% for novel drugs, demonstrating strong generalization. However, the system depends on structured biomedical triples and does not operate directly on unstructured clinical narratives, limiting its utility in NLP-based settings.
Unlike previous models that relied on structured biomedical triples or omics data, our LoRA-BiomedBERT model operates directly on free-text interaction statements, making it more suitable for NLP-based clinical decision systems where such unstructured data is prevalent.
Despite recent advances, many DDI classification models either demand extensive experimental inputs or fail to process real-world textual descriptions found in clinical databases. This leaves academic prototypes apart from accessible tools. Our work seeks to close this gap by creating a lightweight, polarity-aware model capable of direct DDIs straight from raw narrative data classification. Another purpose of the research was to offer a scalable solution for real-time decision assistance in healthcare situations where structured data is often limited. The suggested technique balances accuracy against practicality.
Table 1 summarizes representative approaches in DDI prediction, highlighting the input types, architectures, and whether polarity awareness was supported. Most prior works focused on interaction presence rather than directional classification, which this study addresses explicitly.
3. Methodology
The experimental workflow follows a linear four-step process. It starts with label acquisition, followed by model fine-tuning, large-scale inference, and finally, ledger-guided refinement. The workflow is illustrated in
Figure 1.
3.1. Polarity-Balanced Seed Corpus
Three curated resources containing explicit polarity information were used, as follows: DrugComb-ASDCD synergism, a dataset of synergistic drug pairs experimentally validated in cancer studies; the comprehensive DrugComb synergy-score matrix (Bliss, Loewe, ZIP, HSA), which quantifies interaction effects across multiple models; and the DrugBank Antagonism file, which lists clinically documented cases where drug co-administration leads to reduced efficacy or adverse outcomes. The data was normalized by mapping drug names and synonyms to their canonical DrugBank identifiers using exact string matching and synonym resolution heuristics, then merged across sources. Rows were retained only when the synergy or antagonism score deviated by at least ±10% from the expected null effect (i.e., no interaction), or when an adverse outcome was documented in the literature. The 10% threshold was selected empirically based on prior use in DrugComb synergy scoring and to exclude minor numerical fluctuations around neutral interactions. It provides a practical filter that reduces noise while retaining biologically meaningful polarity shifts.
After this procedure, the resulting complete set was balanced, yielding 7436 synergistic and 7436 antagonistic interaction sentences (14,872 total), corresponding to an equal class distribution with a balancing ratio of 1:1. These were formatted as short declarative statements (e.g., “Drug A increases the anticoagulant effect of Drug B”). In classification problems, class balancing is essential as unbalanced datasets might cause biased learning and worse generalization for the minority class. In our research, the naturally balanced class distribution made neither intentional oversampling nor undersampling necessary. As shown in [
19], this helps reduce problems such as inflated accuracy at the expense of recall or model overfitting to the dominant class.
To evaluate generalization, the corpus was split into three files, as follows: a training set (total of 11,744 values) and a held-out test set (total of 2936 values), and a validation set (192 values), displayed in
Table 2 below for better clarity. No drug entity appears in both files, guaranteeing that the model must extrapolate polarity to unseen drugs rather than memorize drug-specific phrases.
The model was provided with the Interaction_Description field values, which are a set of natural language sentences. Structured identifiers and labels were only used to put together datasets and keep track of metadata. We used BiomedBERT’s default tokenizer to break these sentences into tokens, keeping the meaning of each sentence intact. Structured fields, like DrugBank IDs, were kept only for labeling and metadata purposes.
3.2. LoRA Fine-Tuning of BiomedBERT
In the present work, the
microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext model was fine-tuned using LoRA [
8]. For each self-attention block, the original query and value projection matrices
are augmented by a trainable low-rank term:
so that the adapted weights become as follows:
With rank and (matching BiomedBERT’s hidden size), two low-rank matrices B and A together introduce exactly 12,288 additional parameters—only 0.01117% of the base model—while recovering the representational capacity of a full-rank update. The value was selected based on prior LoRA studies and early empirical results indicating a favorable tradeoff between accuracy and parameter efficiency.
Training minimizes the cross-entropy objective:
where
denote the synergistic and antagonistic classes, and
is the logit for sample
i and class
c. The Adam optimizer with weight decay (AdamW), set with
,
, and
was used in optimization. Helping to stabilize convergence and prevent oscillations during training, these values regulate the exponential decay rates for the moving averages of gradient (first moment) and squared gradient (second moment). The selected parameters conform to conventional defaults experimentally proved to be suitable for transformer-based designs. We employed early stopping after two epochs without improvement to avoid overfitting using a learning rate of
with 10% linear warm-up.
Training was performed on an NVIDIA RTX 3060 with 12 GB VRAM. At batch size 8 and sequence length 256, convergence was reached within 6–7 epochs over ∼3.5 h. The LoRA adapter introduced 11,744 trainable examples, allowing for practical fine-tuning on consumer hardware. The resulting adapter weighs <10 MB and can be merged into the base checkpoint on-the-fly or kept separate for rapid version control.
While DrugBank provides millions of interaction descriptions, only a small subset of polarity-labeled interactions could be identified from DrugComb and the DrugBank antagonism file, which was used to build the supervised training set.
After fine-tuning, the adapter-augmented model was used to label all remaining interaction sentences in DrugBank v5.1.10. Prior to inference, every pair present in the seed corpus was removed. This ensured a strict separation between training and application data. Roughly 1.5 million sentences were streamed through the model in batches of 256; predictions, confidences, and checkpoint hashes were inserted into a resumable SQLite ledger. This exhaustive logging allows downstream rules or pharmacists to review low-confidence pairs without re-executing the classifier.
3.3. Ledger-Guided Refinement
The model does not throw away any of its predictions, even those it is unsure about. Every output is logged in a small SQLite file that acts like a ledger. Along with the label, it saves the confidence score, the sentence it was based on, and some basic context like model version and timestamp. This setup makes it easy to track what the model said, why it said it, and how certain it was. That way, anything suspicious or borderline can be reviewed later or used to refine future versions. Instead, the ledger allows post hoc decisions—e.g., identifying low-confidence samples for human review or retraining. Unlike hard-threshold filtering, this confidence-aware logging enables rules-based feedback (e.g., ATC class contradictions) and pharmacist-guided refinement without data loss. It also supports resumable inference and persistent tracking of model decisions for each processed interaction.
5. Experimental Evaluation
Experimental results show that applying LoRA tuning to BiomedBERT yields a strong gain in polarity classification without the heavy cost of full model retraining. After eight training epochs, the fine-tuned model correctly classified 2347 out of 2936 test samples—an accuracy of roughly 79.96%.
Figure 2 provides a multi-angle comparison. In
Figure 2a,b, the confusion matrices show that while the baseline model struggles with both classes, LoRA recovers a more balanced and accurate classification.
Figure 2c tracks clear gains across accuracy, F1, precision, and recall.
Figure 2d further confirms that LoRA improves both synergistic and antagonistic recognition.
ROC curves in
Figure 2e show significant improvement in area under the curve (AUC), rising from 0.449/0.644 (baseline) to 0.864/0.866 (LoRA). This demonstrates how well the model balances precision and recall when it finds synergistic interactions, which were the positive class during evaluation. The model stays very confident across a wide range of thresholds, with an area under the curve of 0.865. In real life, this is useful because decisions often depend on being able to tell the difference between combinations that are reliably good and those that are neutral or bad.
However, since precision–recall (PR) curves are asymmetric, a complete view of model performance requires plotting an additional curve using
Synergistic as the positive class. This is achieved by inverting the class labels and their associated probabilities. The multi-curve evaluation in
Figure 2f highlights recall–precision trade-offs between classes, exposes mild imbalance, and reveals asymmetries that would be hidden in scalar metrics like F1 or accuracy. In clinical applications, a higher PR curve is desirable when false positives (e.g., predicting synergy when not present) would lead to risky combinations. Therefore, this analysis helps prioritize recall or precision depending on the clinical tolerance for risk.
We trained and tested the model in both the synergistic and antagonistic polarity classes, using balanced data.
Figure 3 shows the features of the data set, such as the class distribution, how the data set is divided up, and the lengths of the input sentences. The accuracy, precision, recall, F1 score, and confusion matrix all show how well both classes did. During training and analysis, no class was left out.
To obtain a point of reference, we also ran the base BiomedBERT model without any fine-tuning. We used the same test set and did not apply LoRA or any task-specific adjustments. The results were clearly weaker, especially for synergy cases, where the model often missed the correct label; see
Table 3. This makes sense, since the original model was not trained to recognize polarity. These baseline scores helped us gauge how much the tuning process actually improves the outcome.
In more concrete terms, it recognizes synergistic statements nine times out of ten (recall ≈ 89.59%) and still identifies around 70.31% of antagonistic ones. These results demonstrate a solid performance for sentence-level DDI extraction, particularly given that only 11,744 parameters were updated during training (a fraction of the base model).
Several follow-up experiments—varying LoRA rank, -scaling, dropout, and learning rate—exhibited the same pattern, that is, accuracy fluctuated within a narrow ±2% band, while memory consumption and training time remained essentially constant. This stability suggests that the model’s performance is largely driven by the pretrained biomedical priors of the backbone, rather than fine-tuned hyperparameter settings.
Misclassifications tend to cluster around vague formulations such as “Drug A may affect the activity of Drug B,” where the sentence provides no explicit indication of benefit or harm. It is expected that recall for the Antagonistic class will improve once such borderline examples are incorporated into the next pseudo-labeling round, thereby exposing the model to a broader variety of negative cues.
We also checked how the model performs on completely new drug combinations—the ones it never saw during training.
Table 4 shows a side-by-side comparison between the original BiomedBERT and the LoRA-tuned version. LoRA handled the unfamiliar data much better, with big improvements in F1 and precision. This suggests it does not just memorize, but can generalize to new cases.
To better understand how the model performs on drugs it has not seen before,
Figure 4 presents a side-by-side comparison of the baseline and LoRA-enhanced versions. In
Figure 4a,b, we see that the original model fails to detect any synergistic interactions, whereas the LoRA variant manages to correctly identify both types with strong precision. The bar chart in
Figure 4c breaks down the metric gains, showing the biggest improvements in F1 score and precision. ROC and AUC scores for each class are shown in
Figure 4d,e, and they indicate that the model can separate synergistic and antagonistic cases with high reliability. Lastly, the curves in
Figure 4f suggest that LoRA remains confident across a wide range of thresholds, which is useful when making real-world decisions based on prediction certainty.
The results validate LoRA as a pragmatic compromise between performance and deployability. Achieving almost 86% accuracy on new drugs with only 11,744 trainable parameters (2936 test parameters and 192 parameters for validation) provides a credible foundation for integration into live CDSS modules. Meanwhile, the adapter’s small footprint ensures ease of installation and version control.
7. Conclusions
The present study describes a polarity-aware classification framework based on LoRA-tuned BiomedBERT that can tell the difference between synergistic and antagonistic DDIs using sentence-level input. The system was trained on a dataset with polarity labels that came from DrugComb and DrugBank, and it was tested on a set that did not have any drug entities that were also in the training set.
This method looks promising for use in clinical support tools, especially in cases where structured labels are not available but there is still access to sentence-level interaction descriptions. It is efficient enough to run in limited environments and does not require full model retraining. We have not tested the model yet on clinical data like patient records or hospital notes. Right now, it only works on labeled sentences from public datasets. Before it can be used in practice, it needs to be tested in more realistic settings.
One of the next things we will do is expand the labels. For now, it only handles synergism and antagonism. We also plan to run it on other biomedical text sources to see how well it holds up outside this setup.